March 18, 2026

How to Handle Multi-Speaker Transcription in Meetings

Identifying who said what transforms raw transcripts into actionable meeting notes. Multi-speaker transcription, or speaker diarization, separates overlapping voices, assigns speaker labels, and maintains conversation flow. This guide demonstrates how to implement robust speaker identification for meeting bots using modern speech recognition APIs and custom processing techniques.

Understanding Speaker Diarization

Speaker diarization answers “who spoke when” without knowing speakers’ identities beforehand. The system analyzes voice characteristics—pitch, tone, speaking rate—to cluster audio segments by speaker. You then map these clusters to actual participants using metadata from the meeting platform.

Choosing the Right Transcription Service

Modern APIs handle diarization automatically. AssemblyAI, Deepgram, and Azure Speech Services all support speaker separation:

import assemblyai as aai

import os

from dotenv import load_dotenv

load_dotenv()

class SpeakerDiarizationService:

def __init__(self):

aai.settings.api_key = os.getenv(“ASSEMBLYAI_API_KEY”)

def transcribe_with_speakers(self, audio_file, expected_speakers=None):

“””Transcribe audio with speaker identification”””

config = aai.TranscriptionConfig(

speaker_labels=True,

speakers_expected=expected_speakers, # Optional hint

language_code=”en_us”

)

transcriber = aai.Transcriber()

transcript = transcriber.transcribe(audio_file, config=config)

if transcript.status == aai.TranscriptStatus.error:

raise Exception(f”Transcription failed: {transcript.error}”)

return transcript

def extract_speaker_segments(self, transcript):

“””Extract timestamped speaker segments”””

segments = []

for utterance in transcript.utterances:

segment = {

‘speaker’: utterance.speaker,

‘start’: utterance.start / 1000, # Convert to seconds

‘end’: utterance.end / 1000,

‘text’: utterance.text,

‘confidence’: utterance.confidence

}

segments.append(segment)

return segments

Mapping Anonymous Speakers to Real Participants

APIs return generic labels like “Speaker A” and “Speaker B”. Map these to actual participants:

class SpeakerMapper:

def __init__(self):

self.speaker_map = {}

self.voice_profiles = {}

def create_participant_list(self, meeting_participants):

“””Initialize mapping from meeting platform data”””

self.participants = {

p[‘id’]: {

‘name’: p[‘name’],

’email’: p.get(’email’),

‘role’: p.get(‘role’, ‘participant’)

}

for p in meeting_participants

}

def map_speaker_to_participant(self, speaker_label, participant_id):

“””Manually map a speaker label to participant”””

self.speaker_map[speaker_label] = participant_id

print(f”Mapped {speaker_label} -> {self.participants[participant_id][‘name’]}”)

def auto_map_speakers(self, segments, participant_join_times):

“””Automatically map speakers based on join times”””

# Group segments by speaker

speaker_segments = {}

for segment in segments:

speaker = segment[‘speaker’]

if speaker not in speaker_segments:

speaker_segments[speaker] = []

speaker_segments[speaker].append(segment)

# Match speakers to participants by timing

for speaker, segs in speaker_segments.items():

first_speech_time = min(s[‘start’] for s in segs)

# Find participant who joined closest before first speech

best_match = None

min_diff = float(‘inf’)

for participant_id, join_time in participant_join_times.items():

if join_time <= first_speech_time:

diff = first_speech_time – join_time

if diff < min_diff:

min_diff = diff

best_match = participant_id

if best_match:

self.map_speaker_to_participant(speaker, best_match)

def get_participant_name(self, speaker_label):

“””Get participant name from speaker label”””

if speaker_label in self.speaker_map:

participant_id = self.speaker_map[speaker_label]

return self.participants[participant_id][‘name’]

return f”Unknown ({speaker_label})”

def apply_mapping(self, segments):

“””Apply speaker mapping to transcript segments”””

mapped_segments = []

for segment in segments:

mapped_segment = segment.copy()

mapped_segment[‘participant’] = self.get_participant_name(

segment[‘speaker’]

)

mapped_segments.append(mapped_segment)

return mapped_segments

Handling Overlapping Speech

Multiple people talking simultaneously creates transcription challenges. Implement overlap detection:

import numpy as np

class OverlapHandler:

def __init__(self, overlap_threshold=0.3):

self.overlap_threshold = overlap_threshold

def detect_overlaps(self, segments):

“””Identify overlapping speech segments”””

overlaps = []

for i, seg1 in enumerate(segments):

for j, seg2 in enumerate(segments[i+1:], start=i+1):

overlap_start = max(seg1[‘start’], seg2[‘start’])

overlap_end = min(seg1[‘end’], seg2[‘end’])

if overlap_end > overlap_start:

overlap_duration = overlap_end – overlap_start

seg1_duration = seg1[‘end’] – seg1[‘start’]

# Calculate overlap percentage

overlap_pct = overlap_duration / seg1_duration

if overlap_pct >= self.overlap_threshold:

overlaps.append({

‘speakers’: [seg1[‘speaker’], seg2[‘speaker’]],

‘start’: overlap_start,

‘end’: overlap_end,

‘duration’: overlap_duration

})

return overlaps

def merge_overlapping_segments(self, segments, overlaps):

“””Merge or split overlapping segments”””

processed_segments = []

for segment in segments:

# Check if segment has significant overlap

has_overlap = False

for overlap in overlaps:

if segment[‘speaker’] in overlap[‘speakers’]:

has_overlap = True

break

if has_overlap:

# Mark as overlapping speech

segment[‘overlap’] = True

segment[‘text’] = f”[Overlap] {segment[‘text’]}”

processed_segments.append(segment)

return processed_segments

Real-Time Speaker Tracking

For live transcription, track speakers in real-time:

from collections import deque

import threading

class RealtimeSpeakerTracker:

def __init__(self, window_size=5):

self.current_speaker = None

self.speaker_history = deque(maxlen=window_size)

self.speaker_durations = {}

self.lock = threading.Lock()

def update_speaker(self, speaker_label, timestamp):

“””Update current speaker and track speaking time”””

with self.lock:

if self.current_speaker != speaker_label:

# Speaker changed

if self.current_speaker:

duration = timestamp – self.last_speaker_start

if self.current_speaker not in self.speaker_durations:

self.speaker_durations[self.current_speaker] = 0

self.speaker_durations[self.current_speaker] += duration

self.current_speaker = speaker_label

self.last_speaker_start = timestamp

self.speaker_history.append((speaker_label, timestamp))

def get_speaking_stats(self):

“””Calculate speaking time statistics”””

total_time = sum(self.speaker_durations.values())

stats = {}

for speaker, duration in self.speaker_durations.items():

percentage = (duration / total_time * 100) if total_time > 0 else 0

stats[speaker] = {

‘duration’: duration,

‘percentage’: round(percentage, 2),

‘turns’: self._count_speaking_turns(speaker)

}

return stats

def _count_speaking_turns(self, speaker):

“””Count number of times speaker took the floor”””

turns = 0

prev_speaker = None

for hist_speaker, _ in self.speaker_history:

if hist_speaker != prev_speaker and hist_speaker == speaker:

turns += 1

prev_speaker = hist_speaker

return turns

def get_dominant_speaker(self):

“””Identify who spoke the most”””

if not self.speaker_durations:

return None

return max(self.speaker_durations.items(), key=lambda x: x[1])[0]

Advanced Speaker Features

Implement speaker verification to improve accuracy:

import hashlib

class SpeakerVerification:

def __init__(self):

self.voice_embeddings = {}

def create_voice_embedding(self, audio_segment, speaker_id):

“””Create unique voice fingerprint (simplified)”””

# In production, use deep learning models like x-vectors

# This is a simplified example using audio characteristics

audio_array = np.frombuffer(audio_segment, dtype=np.int16)

# Extract basic voice features

features = {

‘mean’: np.mean(audio_array),

‘std’: np.std(audio_array),

‘max’: np.max(audio_array),

‘min’: np.min(audio_array)

}

# Create simple hash as embedding

feature_str = str(sorted(features.items()))

embedding = hashlib.md5(feature_str.encode()).hexdigest()

self.voice_embeddings[speaker_id] = embedding

return embedding

def verify_speaker(self, audio_segment, claimed_speaker_id):

“””Verify if audio matches known speaker”””

if claimed_speaker_id not in self.voice_embeddings:

return False

current_embedding = self.create_voice_embedding(

audio_segment,

“temp”

)

stored_embedding = self.voice_embeddings[claimed_speaker_id]

# Simple comparison (use cosine similarity in production)

return current_embedding == stored_embedding

Building the Complete Multi-Speaker System

Integrate all components:

class MultiSpeakerTranscriptionSystem:

def __init__(self):

self.diarization = SpeakerDiarizationService()

self.mapper = SpeakerMapper()

self.overlap_handler = OverlapHandler()

self.tracker = RealtimeSpeakerTracker()

def process_meeting(self, audio_file, participants, join_times):

“””Complete multi-speaker transcription pipeline”””

print(“Starting multi-speaker transcription…”)

# Step 1: Transcribe with speaker diarization

transcript = self.diarization.transcribe_with_speakers(

audio_file,

expected_speakers=len(participants)

)

# Step 2: Extract speaker segments

segments = self.diarization.extract_speaker_segments(transcript)

print(f”Extracted {len(segments)} speaker segments”)

# Step 3: Map speakers to participants

self.mapper.create_participant_list(participants)

self.mapper.auto_map_speakers(segments, join_times)

# Step 4: Apply mapping

mapped_segments = self.mapper.apply_mapping(segments)

# Step 5: Handle overlaps

overlaps = self.overlap_handler.detect_overlaps(mapped_segments)

final_segments = self.overlap_handler.merge_overlapping_segments(

mapped_segments,

overlaps

)

print(f”Detected {len(overlaps)} overlapping speech instances”)

return final_segments

def format_transcript(self, segments):

“””Format multi-speaker transcript for display”””

output = []

output.append(“MULTI-SPEAKER MEETING TRANSCRIPT”)

output.append(“=” * 70)

output.append(“”)

for segment in segments:

timestamp = self._format_time(segment[‘start’])

participant = segment.get(‘participant’, segment[‘speaker’])

text = segment[‘text’]

output.append(f”[{timestamp}] {participant}:”)

output.append(f” {text}”)

output.append(“”)

return “\n”.join(output)

def _format_time(self, seconds):

“””Format seconds to MM:SS”””

minutes = int(seconds // 60)

secs = int(seconds % 60)

return f”{minutes:02d}:{secs:02d}”

def generate_speaking_stats(self, segments):

“””Generate speaking statistics report”””

stats = {}

for segment in segments:

participant = segment.get(‘participant’, segment[‘speaker’])

duration = segment[‘end’] – segment[‘start’]

if participant not in stats:

stats[participant] = {‘duration’: 0, ‘turns’: 0}

stats[participant][‘duration’] += duration

stats[participant][‘turns’] += 1

# Calculate percentages

total_duration = sum(s[‘duration’] for s in stats.values())

for participant in stats:

percentage = (stats[participant][‘duration’] / total_duration * 100)

stats[participant][‘percentage’] = round(percentage, 2)

return stats

def save_transcript(self, segments, filename=”transcript.txt”):

“””Save formatted transcript with statistics”””

formatted = self.format_transcript(segments)

stats = self.generate_speaking_stats(segments)

with open(filename, ‘w’, encoding=’utf-8′) as f:

f.write(formatted)

f.write(“\n” + “=” * 70 + “\n”)

f.write(“SPEAKING STATISTICS\n”)

f.write(“=” * 70 + “\n\n”)

for participant, data in sorted(

stats.items(),

key=lambda x: x[1][‘duration’],

reverse=True

f.write(f”{participant}:\n”)

f.write(f” Speaking time: {data[‘duration’]:.1f}s ({data[‘percentage’]}%)\n”)

f.write(f” Speaking turns: {data[‘turns’]}\n\n”)

print(f”Transcript saved: {filename}”)

# Usage example

if __name__ == “__main__”:

system = MultiSpeakerTranscriptionSystem()

# Define meeting participants

participants = [

{‘id’: ‘1’, ‘name’: ‘Alice Johnson’, ’email’: ‘alice@company.com’},

{‘id’: ‘2’, ‘name’: ‘Bob Smith’, ’email’: ‘bob@company.com’},

{‘id’: ‘3’, ‘name’: ‘Carol White’, ’email’: ‘carol@company.com’}

]

# Participant join times (seconds from meeting start)

join_times = {

‘1’: 0,

‘2’: 15,

‘3’: 30

}

# Process meeting

segments = system.process_meeting(

“meeting_audio.wav”,

participants,

join_times

)

# Save results

system.save_transcript(segments, “meeting_transcript.txt”)

Optimization Tips

Use speaker count hints to improve accuracy—telling the API how many speakers to expect reduces confusion. Pre-process audio to remove silence and background noise. Cache voice profiles for recurring participants to speed up identification in future meetings.

Monitor confidence scores. Low confidence indicates challenging audio or speaker confusion. Flag these segments for manual review in critical applications.

Your multi-speaker transcription system now accurately identifies speakers, handles overlapping speech, generates participation statistics, and produces professional meeting transcripts with clear speaker attribution.

Conclusion

Implementing robust multi-speaker transcription requires combining API-based diarization with intelligent speaker mapping, overlap handling, and real-time tracking to create accurate, actionable meeting transcripts.

If you want production-ready multi-speaker transcription without building complex systems, consider Meetstream.ai API, which automatically handles speaker identification and participant mapping across all major meeting platforms.