How to Handle Multi-Speaker Transcription in Meetings

Identifying who said what transforms raw transcripts into actionable meeting notes. Multi-speaker transcription, or speaker diarization, separates overlapping voices, assigns speaker labels, and maintains conversation flow. This guide demonstrates how to implement robust speaker identification for meeting bots using modern speech recognition APIs and custom processing techniques.

Understanding Speaker Diarization

Speaker diarization answers “who spoke when” without knowing speakers’ identities beforehand. The system analyzes voice characteristics—pitch, tone, speaking rate—to cluster audio segments by speaker. You then map these clusters to actual participants using metadata from the meeting platform.

Choosing the Right Transcription Service

Modern APIs handle diarization automatically. AssemblyAI, Deepgram, and Azure Speech Services all support speaker separation:

import assemblyai as aai

import os

from dotenv import load_dotenv

load_dotenv()

class SpeakerDiarizationService:

    def __init__(self):

        aai.settings.api_key = os.getenv(“ASSEMBLYAI_API_KEY”)

    def transcribe_with_speakers(self, audio_file, expected_speakers=None):

        “””Transcribe audio with speaker identification”””

        config = aai.TranscriptionConfig(

            speaker_labels=True,

            speakers_expected=expected_speakers,  # Optional hint

            language_code=”en_us”

        )

        transcriber = aai.Transcriber()

        transcript = transcriber.transcribe(audio_file, config=config)

        if transcript.status == aai.TranscriptStatus.error:

            raise Exception(f”Transcription failed: {transcript.error}”)

        return transcript

    def extract_speaker_segments(self, transcript):

        “””Extract timestamped speaker segments”””

        segments = []

        for utterance in transcript.utterances:

            segment = {

                ‘speaker’: utterance.speaker,

                ‘start’: utterance.start / 1000,  # Convert to seconds

                ‘end’: utterance.end / 1000,

                ‘text’: utterance.text,

                ‘confidence’: utterance.confidence

            }

            segments.append(segment)

        return segments

Mapping Anonymous Speakers to Real Participants

APIs return generic labels like “Speaker A” and “Speaker B”. Map these to actual participants:

class SpeakerMapper:

    def __init__(self):

        self.speaker_map = {}

        self.voice_profiles = {}

    def create_participant_list(self, meeting_participants):

        “””Initialize mapping from meeting platform data”””

        self.participants = {

            p[‘id’]: {

                ‘name’: p[‘name’],

                ’email’: p.get(’email’),

                ‘role’: p.get(‘role’, ‘participant’)

            }

            for p in meeting_participants

        }

    def map_speaker_to_participant(self, speaker_label, participant_id):

        “””Manually map a speaker label to participant”””

        self.speaker_map[speaker_label] = participant_id

        print(f”Mapped {speaker_label} -> {self.participants[participant_id][‘name’]}”)

    def auto_map_speakers(self, segments, participant_join_times):

        “””Automatically map speakers based on join times”””

        # Group segments by speaker

        speaker_segments = {}

        for segment in segments:

            speaker = segment[‘speaker’]

            if speaker not in speaker_segments:

                speaker_segments[speaker] = []

            speaker_segments[speaker].append(segment)

        # Match speakers to participants by timing

        for speaker, segs in speaker_segments.items():

            first_speech_time = min(s[‘start’] for s in segs)

            # Find participant who joined closest before first speech

            best_match = None

            min_diff = float(‘inf’)

            for participant_id, join_time in participant_join_times.items():

                if join_time <= first_speech_time:

                    diff = first_speech_time – join_time

                    if diff < min_diff:

                        min_diff = diff

                        best_match = participant_id

            if best_match:

                self.map_speaker_to_participant(speaker, best_match)

    def get_participant_name(self, speaker_label):

        “””Get participant name from speaker label”””

        if speaker_label in self.speaker_map:

            participant_id = self.speaker_map[speaker_label]

            return self.participants[participant_id][‘name’]

        return f”Unknown ({speaker_label})”

    def apply_mapping(self, segments):

        “””Apply speaker mapping to transcript segments”””

        mapped_segments = []

        for segment in segments:

            mapped_segment = segment.copy()

            mapped_segment[‘participant’] = self.get_participant_name(

                segment[‘speaker’]

            )

            mapped_segments.append(mapped_segment)

        return mapped_segments

Handling Overlapping Speech

Multiple people talking simultaneously creates transcription challenges. Implement overlap detection:

import numpy as np

class OverlapHandler:

    def __init__(self, overlap_threshold=0.3):

        self.overlap_threshold = overlap_threshold

    def detect_overlaps(self, segments):

        “””Identify overlapping speech segments”””

        overlaps = []

        for i, seg1 in enumerate(segments):

            for j, seg2 in enumerate(segments[i+1:], start=i+1):

                overlap_start = max(seg1[‘start’], seg2[‘start’])

                overlap_end = min(seg1[‘end’], seg2[‘end’])

                if overlap_end > overlap_start:

                    overlap_duration = overlap_end – overlap_start

                    seg1_duration = seg1[‘end’] – seg1[‘start’]

                    # Calculate overlap percentage

                    overlap_pct = overlap_duration / seg1_duration

                    if overlap_pct >= self.overlap_threshold:

                        overlaps.append({

                            ‘speakers’: [seg1[‘speaker’], seg2[‘speaker’]],

                            ‘start’: overlap_start,

                            ‘end’: overlap_end,

                            ‘duration’: overlap_duration

                        })

        return overlaps

    def merge_overlapping_segments(self, segments, overlaps):

        “””Merge or split overlapping segments”””

        processed_segments = []

        for segment in segments:

            # Check if segment has significant overlap

            has_overlap = False

            for overlap in overlaps:

                if segment[‘speaker’] in overlap[‘speakers’]:

                    has_overlap = True

                    break

            if has_overlap:

                # Mark as overlapping speech

                segment[‘overlap’] = True

                segment[‘text’] = f”[Overlap] {segment[‘text’]}”

            processed_segments.append(segment)

        return processed_segments

Real-Time Speaker Tracking

For live transcription, track speakers in real-time:

from collections import deque

import threading

class RealtimeSpeakerTracker:

    def __init__(self, window_size=5):

        self.current_speaker = None

        self.speaker_history = deque(maxlen=window_size)

        self.speaker_durations = {}

        self.lock = threading.Lock()

    def update_speaker(self, speaker_label, timestamp):

        “””Update current speaker and track speaking time”””

        with self.lock:

            if self.current_speaker != speaker_label:

                # Speaker changed

                if self.current_speaker:

                    duration = timestamp – self.last_speaker_start

                    if self.current_speaker not in self.speaker_durations:

                        self.speaker_durations[self.current_speaker] = 0

                    self.speaker_durations[self.current_speaker] += duration

                self.current_speaker = speaker_label

                self.last_speaker_start = timestamp

                self.speaker_history.append((speaker_label, timestamp))

    def get_speaking_stats(self):

        “””Calculate speaking time statistics”””

        total_time = sum(self.speaker_durations.values())

        stats = {}

        for speaker, duration in self.speaker_durations.items():

            percentage = (duration / total_time * 100) if total_time > 0 else 0

            stats[speaker] = {

                ‘duration’: duration,

                ‘percentage’: round(percentage, 2),

                ‘turns’: self._count_speaking_turns(speaker)

            }

        return stats

    def _count_speaking_turns(self, speaker):

        “””Count number of times speaker took the floor”””

        turns = 0

        prev_speaker = None

        for hist_speaker, _ in self.speaker_history:

            if hist_speaker != prev_speaker and hist_speaker == speaker:

                turns += 1

            prev_speaker = hist_speaker

        return turns

    def get_dominant_speaker(self):

        “””Identify who spoke the most”””

        if not self.speaker_durations:

            return None

        return max(self.speaker_durations.items(), key=lambda x: x[1])[0]

Advanced Speaker Features

Implement speaker verification to improve accuracy:

import hashlib

class SpeakerVerification:

    def __init__(self):

        self.voice_embeddings = {}

    def create_voice_embedding(self, audio_segment, speaker_id):

        “””Create unique voice fingerprint (simplified)”””

        # In production, use deep learning models like x-vectors

        # This is a simplified example using audio characteristics

        audio_array = np.frombuffer(audio_segment, dtype=np.int16)

        # Extract basic voice features

        features = {

            ‘mean’: np.mean(audio_array),

            ‘std’: np.std(audio_array),

            ‘max’: np.max(audio_array),

            ‘min’: np.min(audio_array)

        }

        # Create simple hash as embedding

        feature_str = str(sorted(features.items()))

        embedding = hashlib.md5(feature_str.encode()).hexdigest()

        self.voice_embeddings[speaker_id] = embedding

        return embedding

    def verify_speaker(self, audio_segment, claimed_speaker_id):

        “””Verify if audio matches known speaker”””

        if claimed_speaker_id not in self.voice_embeddings:

            return False

        current_embedding = self.create_voice_embedding(

            audio_segment, 

            “temp”

        )

        stored_embedding = self.voice_embeddings[claimed_speaker_id]

        # Simple comparison (use cosine similarity in production)

        return current_embedding == stored_embedding

Building the Complete Multi-Speaker System

Integrate all components:

class MultiSpeakerTranscriptionSystem:

    def __init__(self):

        self.diarization = SpeakerDiarizationService()

        self.mapper = SpeakerMapper()

        self.overlap_handler = OverlapHandler()

        self.tracker = RealtimeSpeakerTracker()

    def process_meeting(self, audio_file, participants, join_times):

        “””Complete multi-speaker transcription pipeline”””

        print(“Starting multi-speaker transcription…”)

        # Step 1: Transcribe with speaker diarization

        transcript = self.diarization.transcribe_with_speakers(

            audio_file,

            expected_speakers=len(participants)

        )

        # Step 2: Extract speaker segments

        segments = self.diarization.extract_speaker_segments(transcript)

        print(f”Extracted {len(segments)} speaker segments”)

        # Step 3: Map speakers to participants

        self.mapper.create_participant_list(participants)

        self.mapper.auto_map_speakers(segments, join_times)

        # Step 4: Apply mapping

        mapped_segments = self.mapper.apply_mapping(segments)

        # Step 5: Handle overlaps

        overlaps = self.overlap_handler.detect_overlaps(mapped_segments)

        final_segments = self.overlap_handler.merge_overlapping_segments(

            mapped_segments,

            overlaps

        )

        print(f”Detected {len(overlaps)} overlapping speech instances”)

        return final_segments

    def format_transcript(self, segments):

        “””Format multi-speaker transcript for display”””

        output = []

        output.append(“MULTI-SPEAKER MEETING TRANSCRIPT”)

        output.append(“=” * 70)

        output.append(“”)

        for segment in segments:

            timestamp = self._format_time(segment[‘start’])

            participant = segment.get(‘participant’, segment[‘speaker’])

            text = segment[‘text’]

            output.append(f”[{timestamp}] {participant}:”)

            output.append(f”  {text}”)

            output.append(“”)

        return “\n”.join(output)

    def _format_time(self, seconds):

        “””Format seconds to MM:SS”””

        minutes = int(seconds // 60)

        secs = int(seconds % 60)

        return f”{minutes:02d}:{secs:02d}”

    def generate_speaking_stats(self, segments):

        “””Generate speaking statistics report”””

        stats = {}

        for segment in segments:

            participant = segment.get(‘participant’, segment[‘speaker’])

            duration = segment[‘end’] – segment[‘start’]

            if participant not in stats:

                stats[participant] = {‘duration’: 0, ‘turns’: 0}

            stats[participant][‘duration’] += duration

            stats[participant][‘turns’] += 1

        # Calculate percentages

        total_duration = sum(s[‘duration’] for s in stats.values())

        for participant in stats:

            percentage = (stats[participant][‘duration’] / total_duration * 100)

            stats[participant][‘percentage’] = round(percentage, 2)

        return stats

    def save_transcript(self, segments, filename=”transcript.txt”):

        “””Save formatted transcript with statistics”””

        formatted = self.format_transcript(segments)

        stats = self.generate_speaking_stats(segments)

        with open(filename, ‘w’, encoding=’utf-8′) as f:

            f.write(formatted)

            f.write(“\n” + “=” * 70 + “\n”)

            f.write(“SPEAKING STATISTICS\n”)

            f.write(“=” * 70 + “\n\n”)

            for participant, data in sorted(

                stats.items(),

                key=lambda x: x[1][‘duration’],

                reverse=True

            ):

                f.write(f”{participant}:\n”)

                f.write(f”  Speaking time: {data[‘duration’]:.1f}s ({data[‘percentage’]}%)\n”)

                f.write(f”  Speaking turns: {data[‘turns’]}\n\n”)

        print(f”Transcript saved: {filename}”)

# Usage example

if __name__ == “__main__”:

    system = MultiSpeakerTranscriptionSystem()

    # Define meeting participants

    participants = [

        {‘id’: ‘1’, ‘name’: ‘Alice Johnson’, ’email’: ‘alice@company.com’},

        {‘id’: ‘2’, ‘name’: ‘Bob Smith’, ’email’: ‘bob@company.com’},

        {‘id’: ‘3’, ‘name’: ‘Carol White’, ’email’: ‘carol@company.com’}

    ]

    # Participant join times (seconds from meeting start)

    join_times = {

        ‘1’: 0,

        ‘2’: 15,

        ‘3’: 30

    }

    # Process meeting

    segments = system.process_meeting(

        “meeting_audio.wav”,

        participants,

        join_times

    )

    # Save results

    system.save_transcript(segments, “meeting_transcript.txt”)

Optimization Tips

Use speaker count hints to improve accuracy—telling the API how many speakers to expect reduces confusion. Pre-process audio to remove silence and background noise. Cache voice profiles for recurring participants to speed up identification in future meetings.

Monitor confidence scores. Low confidence indicates challenging audio or speaker confusion. Flag these segments for manual review in critical applications.

Your multi-speaker transcription system now accurately identifies speakers, handles overlapping speech, generates participation statistics, and produces professional meeting transcripts with clear speaker attribution.

Conclusion

Implementing robust multi-speaker transcription requires combining API-based diarization with intelligent speaker mapping, overlap handling, and real-time tracking to create accurate, actionable meeting transcripts.

If you want production-ready multi-speaker transcription without building complex systems, consider Meetstream.ai API, which automatically handles speaker identification and participant mapping across all major meeting platforms.

Leave a Reply

Your email address will not be published. Required fields are marked *