Identifying who said what transforms raw transcripts into actionable meeting notes. Multi-speaker transcription, or speaker diarization, separates overlapping voices, assigns speaker labels, and maintains conversation flow. This guide demonstrates how to implement robust speaker identification for meeting bots using modern speech recognition APIs and custom processing techniques.
Understanding Speaker Diarization
Speaker diarization answers “who spoke when” without knowing speakers’ identities beforehand. The system analyzes voice characteristics—pitch, tone, speaking rate—to cluster audio segments by speaker. You then map these clusters to actual participants using metadata from the meeting platform.
Choosing the Right Transcription Service
Modern APIs handle diarization automatically. AssemblyAI, Deepgram, and Azure Speech Services all support speaker separation:
import assemblyai as aai
import os
from dotenv import load_dotenv
load_dotenv()
class SpeakerDiarizationService:
def __init__(self):
aai.settings.api_key = os.getenv(“ASSEMBLYAI_API_KEY”)
def transcribe_with_speakers(self, audio_file, expected_speakers=None):
“””Transcribe audio with speaker identification”””
config = aai.TranscriptionConfig(
speaker_labels=True,
speakers_expected=expected_speakers, # Optional hint
language_code=”en_us”
)
transcriber = aai.Transcriber()
transcript = transcriber.transcribe(audio_file, config=config)
if transcript.status == aai.TranscriptStatus.error:
raise Exception(f”Transcription failed: {transcript.error}”)
return transcript
def extract_speaker_segments(self, transcript):
“””Extract timestamped speaker segments”””
segments = []
for utterance in transcript.utterances:
segment = {
‘speaker’: utterance.speaker,
‘start’: utterance.start / 1000, # Convert to seconds
‘end’: utterance.end / 1000,
‘text’: utterance.text,
‘confidence’: utterance.confidence
}
segments.append(segment)
return segments
Mapping Anonymous Speakers to Real Participants
APIs return generic labels like “Speaker A” and “Speaker B”. Map these to actual participants:
class SpeakerMapper:
def __init__(self):
self.speaker_map = {}
self.voice_profiles = {}
def create_participant_list(self, meeting_participants):
“””Initialize mapping from meeting platform data”””
self.participants = {
p[‘id’]: {
‘name’: p[‘name’],
’email’: p.get(’email’),
‘role’: p.get(‘role’, ‘participant’)
}
for p in meeting_participants
}
def map_speaker_to_participant(self, speaker_label, participant_id):
“””Manually map a speaker label to participant”””
self.speaker_map[speaker_label] = participant_id
print(f”Mapped {speaker_label} -> {self.participants[participant_id][‘name’]}”)
def auto_map_speakers(self, segments, participant_join_times):
“””Automatically map speakers based on join times”””
# Group segments by speaker
speaker_segments = {}
for segment in segments:
speaker = segment[‘speaker’]
if speaker not in speaker_segments:
speaker_segments[speaker] = []
speaker_segments[speaker].append(segment)
# Match speakers to participants by timing
for speaker, segs in speaker_segments.items():
first_speech_time = min(s[‘start’] for s in segs)
# Find participant who joined closest before first speech
best_match = None
min_diff = float(‘inf’)
for participant_id, join_time in participant_join_times.items():
if join_time <= first_speech_time:
diff = first_speech_time – join_time
if diff < min_diff:
min_diff = diff
best_match = participant_id
if best_match:
self.map_speaker_to_participant(speaker, best_match)
def get_participant_name(self, speaker_label):
“””Get participant name from speaker label”””
if speaker_label in self.speaker_map:
participant_id = self.speaker_map[speaker_label]
return self.participants[participant_id][‘name’]
return f”Unknown ({speaker_label})”
def apply_mapping(self, segments):
“””Apply speaker mapping to transcript segments”””
mapped_segments = []
for segment in segments:
mapped_segment = segment.copy()
mapped_segment[‘participant’] = self.get_participant_name(
segment[‘speaker’]
)
mapped_segments.append(mapped_segment)
return mapped_segments
Handling Overlapping Speech
Multiple people talking simultaneously creates transcription challenges. Implement overlap detection:
import numpy as np
class OverlapHandler:
def __init__(self, overlap_threshold=0.3):
self.overlap_threshold = overlap_threshold
def detect_overlaps(self, segments):
“””Identify overlapping speech segments”””
overlaps = []
for i, seg1 in enumerate(segments):
for j, seg2 in enumerate(segments[i+1:], start=i+1):
overlap_start = max(seg1[‘start’], seg2[‘start’])
overlap_end = min(seg1[‘end’], seg2[‘end’])
if overlap_end > overlap_start:
overlap_duration = overlap_end – overlap_start
seg1_duration = seg1[‘end’] – seg1[‘start’]
# Calculate overlap percentage
overlap_pct = overlap_duration / seg1_duration
if overlap_pct >= self.overlap_threshold:
overlaps.append({
‘speakers’: [seg1[‘speaker’], seg2[‘speaker’]],
‘start’: overlap_start,
‘end’: overlap_end,
‘duration’: overlap_duration
})
return overlaps
def merge_overlapping_segments(self, segments, overlaps):
“””Merge or split overlapping segments”””
processed_segments = []
for segment in segments:
# Check if segment has significant overlap
has_overlap = False
for overlap in overlaps:
if segment[‘speaker’] in overlap[‘speakers’]:
has_overlap = True
break
if has_overlap:
# Mark as overlapping speech
segment[‘overlap’] = True
segment[‘text’] = f”[Overlap] {segment[‘text’]}”
processed_segments.append(segment)
return processed_segments
Real-Time Speaker Tracking
For live transcription, track speakers in real-time:
from collections import deque
import threading
class RealtimeSpeakerTracker:
def __init__(self, window_size=5):
self.current_speaker = None
self.speaker_history = deque(maxlen=window_size)
self.speaker_durations = {}
self.lock = threading.Lock()
def update_speaker(self, speaker_label, timestamp):
“””Update current speaker and track speaking time”””
with self.lock:
if self.current_speaker != speaker_label:
# Speaker changed
if self.current_speaker:
duration = timestamp – self.last_speaker_start
if self.current_speaker not in self.speaker_durations:
self.speaker_durations[self.current_speaker] = 0
self.speaker_durations[self.current_speaker] += duration
self.current_speaker = speaker_label
self.last_speaker_start = timestamp
self.speaker_history.append((speaker_label, timestamp))
def get_speaking_stats(self):
“””Calculate speaking time statistics”””
total_time = sum(self.speaker_durations.values())
stats = {}
for speaker, duration in self.speaker_durations.items():
percentage = (duration / total_time * 100) if total_time > 0 else 0
stats[speaker] = {
‘duration’: duration,
‘percentage’: round(percentage, 2),
‘turns’: self._count_speaking_turns(speaker)
}
return stats
def _count_speaking_turns(self, speaker):
“””Count number of times speaker took the floor”””
turns = 0
prev_speaker = None
for hist_speaker, _ in self.speaker_history:
if hist_speaker != prev_speaker and hist_speaker == speaker:
turns += 1
prev_speaker = hist_speaker
return turns
def get_dominant_speaker(self):
“””Identify who spoke the most”””
if not self.speaker_durations:
return None
return max(self.speaker_durations.items(), key=lambda x: x[1])[0]
Advanced Speaker Features
Implement speaker verification to improve accuracy:
import hashlib
class SpeakerVerification:
def __init__(self):
self.voice_embeddings = {}
def create_voice_embedding(self, audio_segment, speaker_id):
“””Create unique voice fingerprint (simplified)”””
# In production, use deep learning models like x-vectors
# This is a simplified example using audio characteristics
audio_array = np.frombuffer(audio_segment, dtype=np.int16)
# Extract basic voice features
features = {
‘mean’: np.mean(audio_array),
‘std’: np.std(audio_array),
‘max’: np.max(audio_array),
‘min’: np.min(audio_array)
}
# Create simple hash as embedding
feature_str = str(sorted(features.items()))
embedding = hashlib.md5(feature_str.encode()).hexdigest()
self.voice_embeddings[speaker_id] = embedding
return embedding
def verify_speaker(self, audio_segment, claimed_speaker_id):
“””Verify if audio matches known speaker”””
if claimed_speaker_id not in self.voice_embeddings:
return False
current_embedding = self.create_voice_embedding(
audio_segment,
“temp”
)
stored_embedding = self.voice_embeddings[claimed_speaker_id]
# Simple comparison (use cosine similarity in production)
return current_embedding == stored_embedding
Building the Complete Multi-Speaker System
Integrate all components:
class MultiSpeakerTranscriptionSystem:
def __init__(self):
self.diarization = SpeakerDiarizationService()
self.mapper = SpeakerMapper()
self.overlap_handler = OverlapHandler()
self.tracker = RealtimeSpeakerTracker()
def process_meeting(self, audio_file, participants, join_times):
“””Complete multi-speaker transcription pipeline”””
print(“Starting multi-speaker transcription…”)
# Step 1: Transcribe with speaker diarization
transcript = self.diarization.transcribe_with_speakers(
audio_file,
expected_speakers=len(participants)
)
# Step 2: Extract speaker segments
segments = self.diarization.extract_speaker_segments(transcript)
print(f”Extracted {len(segments)} speaker segments”)
# Step 3: Map speakers to participants
self.mapper.create_participant_list(participants)
self.mapper.auto_map_speakers(segments, join_times)
# Step 4: Apply mapping
mapped_segments = self.mapper.apply_mapping(segments)
# Step 5: Handle overlaps
overlaps = self.overlap_handler.detect_overlaps(mapped_segments)
final_segments = self.overlap_handler.merge_overlapping_segments(
mapped_segments,
overlaps
)
print(f”Detected {len(overlaps)} overlapping speech instances”)
return final_segments
def format_transcript(self, segments):
“””Format multi-speaker transcript for display”””
output = []
output.append(“MULTI-SPEAKER MEETING TRANSCRIPT”)
output.append(“=” * 70)
output.append(“”)
for segment in segments:
timestamp = self._format_time(segment[‘start’])
participant = segment.get(‘participant’, segment[‘speaker’])
text = segment[‘text’]
output.append(f”[{timestamp}] {participant}:”)
output.append(f” {text}”)
output.append(“”)
return “\n”.join(output)
def _format_time(self, seconds):
“””Format seconds to MM:SS”””
minutes = int(seconds // 60)
secs = int(seconds % 60)
return f”{minutes:02d}:{secs:02d}”
def generate_speaking_stats(self, segments):
“””Generate speaking statistics report”””
stats = {}
for segment in segments:
participant = segment.get(‘participant’, segment[‘speaker’])
duration = segment[‘end’] – segment[‘start’]
if participant not in stats:
stats[participant] = {‘duration’: 0, ‘turns’: 0}
stats[participant][‘duration’] += duration
stats[participant][‘turns’] += 1
# Calculate percentages
total_duration = sum(s[‘duration’] for s in stats.values())
for participant in stats:
percentage = (stats[participant][‘duration’] / total_duration * 100)
stats[participant][‘percentage’] = round(percentage, 2)
return stats
def save_transcript(self, segments, filename=”transcript.txt”):
“””Save formatted transcript with statistics”””
formatted = self.format_transcript(segments)
stats = self.generate_speaking_stats(segments)
with open(filename, ‘w’, encoding=’utf-8′) as f:
f.write(formatted)
f.write(“\n” + “=” * 70 + “\n”)
f.write(“SPEAKING STATISTICS\n”)
f.write(“=” * 70 + “\n\n”)
for participant, data in sorted(
stats.items(),
key=lambda x: x[1][‘duration’],
reverse=True
):
f.write(f”{participant}:\n”)
f.write(f” Speaking time: {data[‘duration’]:.1f}s ({data[‘percentage’]}%)\n”)
f.write(f” Speaking turns: {data[‘turns’]}\n\n”)
print(f”Transcript saved: {filename}”)
# Usage example
if __name__ == “__main__”:
system = MultiSpeakerTranscriptionSystem()
# Define meeting participants
participants = [
{‘id’: ‘1’, ‘name’: ‘Alice Johnson’, ’email’: ‘alice@company.com’},
{‘id’: ‘2’, ‘name’: ‘Bob Smith’, ’email’: ‘bob@company.com’},
{‘id’: ‘3’, ‘name’: ‘Carol White’, ’email’: ‘carol@company.com’}
]
# Participant join times (seconds from meeting start)
join_times = {
‘1’: 0,
‘2’: 15,
‘3’: 30
}
# Process meeting
segments = system.process_meeting(
“meeting_audio.wav”,
participants,
join_times
)
# Save results
system.save_transcript(segments, “meeting_transcript.txt”)
Optimization Tips
Use speaker count hints to improve accuracy—telling the API how many speakers to expect reduces confusion. Pre-process audio to remove silence and background noise. Cache voice profiles for recurring participants to speed up identification in future meetings.
Monitor confidence scores. Low confidence indicates challenging audio or speaker confusion. Flag these segments for manual review in critical applications.
Your multi-speaker transcription system now accurately identifies speakers, handles overlapping speech, generates participation statistics, and produces professional meeting transcripts with clear speaker attribution.
Conclusion
Implementing robust multi-speaker transcription requires combining API-based diarization with intelligent speaker mapping, overlap handling, and real-time tracking to create accurate, actionable meeting transcripts.
If you want production-ready multi-speaker transcription without building complex systems, consider Meetstream.ai API, which automatically handles speaker identification and participant mapping across all major meeting platforms.