How to Build a Real-Time Transcription Bot vs Post-Meeting Transcription

Choosing between real-time and post-meeting transcription fundamentally shapes your architecture. 

Real-time systems stream audio chunks for instant captions, requiring low-latency pipelines and websocket connections. 

Post-meeting systems process complete recordings, allowing batch optimization and higher accuracy. 

This guide demonstrates both approaches, helping you choose the right architecture for your use case.

Understanding the Trade-offs

Real-time transcription delivers instant feedback but sacrifices accuracy for speed. 

Post-meeting processing achieves higher accuracy through context analysis and multiple passes but delays results. 

Real-time systems cost more—streaming APIs charge per second. Post-meeting systems batch process efficiently but can’t provide live captions.

Real-Time Transcription Architecture

Build a streaming transcription bot using websockets:

import asyncio

import websockets

import json

import base64

from deepgram import Deepgram

class RealtimeTranscriptionBot:

    def __init__(self, api_key):

        self.dg_client = Deepgram(api_key)

        self.socket = None

        self.transcript_buffer = []

    async def start_stream(self, audio_stream):

        “””Start real-time transcription stream”””

        # Configure streaming options

        options = {

            ‘punctuate’: True,

            ‘interim_results’: True,

            ‘language’: ‘en-US’,

            ‘model’: ‘nova-2’,

            ‘smart_format’: True,

            ‘diarize’: True,

            ‘encoding’: ‘linear16’,

            ‘sample_rate’: 16000

        }

        # Create streaming connection

        self.socket = await self.dg_client.transcription.live(options)

        # Register event handlers

        self.socket.registerHandler(

            self.socket.event.TRANSCRIPT_RECEIVED,

            self._on_transcript

        )

        self.socket.registerHandler(

            self.socket.event.CLOSE,

            self._on_close

        )

        # Start audio streaming

        await self._stream_audio(audio_stream)

    async def _stream_audio(self, audio_stream):

        “””Stream audio chunks to transcription service”””

        try:

            while True:

                # Get audio chunk (typically 100-200ms)

                chunk = await audio_stream.read(3200)  # 200ms at 16kHz

                if not chunk:

                    break

                # Send to transcription service

                self.socket.send(chunk)

                # Small delay to prevent overwhelming the API

                await asyncio.sleep(0.01)

        except Exception as e:

            print(f”Streaming error: {e}”)

        finally:

            # Signal end of stream

            self.socket.finish()

    def _on_transcript(self, result):

        “””Handle incoming transcript results”””

        transcript_data = json.loads(result)

        # Check if this is a final result

        is_final = transcript_data.get(‘is_final’, False)

        channel = transcript_data.get(‘channel’, {})

        alternatives = channel.get(‘alternatives’, [])

        if alternatives:

            transcript = alternatives[0].get(‘transcript’, ”)

            confidence = alternatives[0].get(‘confidence’, 0)

            if is_final and transcript.strip():

                # Final result – save to buffer

                words = alternatives[0].get(‘words’, [])

                speaker = words[0].get(‘speaker’, 0) if words else 0

                entry = {

                    ‘speaker’: speaker,

                    ‘text’: transcript,

                    ‘confidence’: confidence,

                    ‘timestamp’: words[0].get(‘start’, 0) if words else 0,

                    ‘is_final’: True

                }

                self.transcript_buffer.append(entry)

                print(f”[FINAL] Speaker {speaker}: {transcript}”)

            elif transcript.strip():

                # Interim result – display but don’t save

                print(f”[INTERIM] {transcript}”, end=’\r’)

    def _on_close(self, _):

        “””Handle connection close”””

        print(“\nTranscription stream closed”)

    def get_transcript(self):

        “””Get accumulated transcript”””

        return self.transcript_buffer

    def save_transcript(self, filename):

        “””Save real-time transcript to file”””

        with open(filename, ‘w’, encoding=’utf-8′) as f:

            f.write(“Real-Time Transcript\n”)

            f.write(“=” * 60 + “\n\n”)

            for entry in self.transcript_buffer:

                timestamp = self._format_time(entry[‘timestamp’])

                speaker = f”Speaker {entry[‘speaker’]}”

                text = entry[‘text’]

                confidence = entry[‘confidence’]

                f.write(f”[{timestamp}] {speaker} (conf: {confidence:.2f}):\n”)

                f.write(f”{text}\n\n”)

    def _format_time(self, seconds):

        “””Format seconds to MM:SS”””

        minutes = int(seconds // 60)

        secs = int(seconds % 60)

        return f”{minutes:02d}:{secs:02d}”

Post-Meeting Transcription Architecture

Build a batch processing system for recorded audio:

import assemblyai as aai

import os

class PostMeetingTranscriber:

    def __init__(self, api_key):

        aai.settings.api_key = api_key

    def transcribe_recording(self, audio_file):

        “””Transcribe complete recording with maximum accuracy”””

        # Configure for maximum accuracy

        config = aai.TranscriptionConfig(

            speaker_labels=True,

            speakers_expected=None,  # Auto-detect

            punctuate=True,

            format_text=True,

            diarize=True,

            # Enhanced features for post-processing

            auto_highlights=True,

            content_safety=True,

            iab_categories=True,

            sentiment_analysis=True,

            entity_detection=True,

            # Language settings

            language_code=”en_us”,

            language_detection=True,

            # Accuracy boosting

            boost_param=”high”

        )

        print(“Starting transcription (this may take several minutes)…”)

        transcriber = aai.Transcriber()

        transcript = transcriber.transcribe(audio_file, config=config)

        if transcript.status == aai.TranscriptStatus.error:

            raise Exception(f”Transcription failed: {transcript.error}”)

        print(“Transcription complete!”)

        return transcript

    def generate_comprehensive_output(self, transcript):

        “””Generate detailed output with all insights”””

        output = {

            ‘transcript’: self._format_transcript(transcript),

            ‘summary’: self._extract_summary(transcript),

            ‘highlights’: self._extract_highlights(transcript),

            ‘action_items’: self._extract_action_items(transcript),

            ‘sentiment’: self._analyze_sentiment(transcript),

            ‘topics’: self._extract_topics(transcript),

            ‘speakers’: self._analyze_speakers(transcript)

        }

        return output

    def _format_transcript(self, transcript):

        “””Format transcript with speaker labels”””

        formatted = []

        for utterance in transcript.utterances:

            timestamp = self._format_time(utterance.start / 1000)

            speaker = f”Speaker {utterance.speaker}”

            text = utterance.text

            formatted.append({

                ‘timestamp’: timestamp,

                ‘speaker’: speaker,

                ‘text’: text

            })

        return formatted

    def _extract_summary(self, transcript):

        “””Extract meeting summary”””

        if hasattr(transcript, ‘summary’) and transcript.summary:

            return transcript.summary

        # Fallback: Create basic summary from highlights

        if hasattr(transcript, ‘auto_highlights’):

            highlights = transcript.auto_highlights

            if highlights and highlights.results:

                summary_points = [h.text for h in highlights.results[:5]]

                return ‘ ‘.join(summary_points)

        return “No summary available”

    def _extract_highlights(self, transcript):

        “””Extract key highlights”””

        highlights = []

        if hasattr(transcript, ‘auto_highlights’) and transcript.auto_highlights:

            for highlight in transcript.auto_highlights.results:

                highlights.append({

                    ‘text’: highlight.text,

                    ‘count’: highlight.count,

                    ‘rank’: highlight.rank,

                    ‘timestamps’: highlight.timestamps

                })

        return highlights

    def _extract_action_items(self, transcript):

        “””Extract action items and follow-ups”””

        action_keywords = [

            ‘will’, ‘should’, ‘need to’, ‘have to’, ‘must’,

            ‘action item’, ‘todo’, ‘follow up’, ‘next step’

        ]

        action_items = []

        for utterance in transcript.utterances:

            text_lower = utterance.text.lower()

            if any(keyword in text_lower for keyword in action_keywords):

                action_items.append({

                    ‘speaker’: f”Speaker {utterance.speaker}”,

                    ‘text’: utterance.text,

                    ‘timestamp’: utterance.start / 1000

                })

        return action_items

    def _analyze_sentiment(self, transcript):

        “””Analyze sentiment throughout meeting”””

        if not hasattr(transcript, ‘sentiment_analysis_results’):

            return []

        sentiments = []

        for result in transcript.sentiment_analysis_results:

            sentiments.append({

                ‘text’: result.text,

                ‘sentiment’: result.sentiment,

                ‘confidence’: result.confidence,

                ‘speaker’: result.speaker if hasattr(result, ‘speaker’) else None

            })

        return sentiments

    def _extract_topics(self, transcript):

        “””Extract main topics discussed”””

        if not hasattr(transcript, ‘iab_categories_result’):

            return []

        topics = []

        if transcript.iab_categories_result:

            results = transcript.iab_categories_result.results

            for result in results:

                for label in result.labels:

                    topics.append({

                        ‘topic’: label.label,

                        ‘relevance’: label.relevance

                    })

        return sorted(topics, key=lambda x: x[‘relevance’], reverse=True)[:10]

    def _analyze_speakers(self, transcript):

        “””Analyze speaker participation”””

        speaker_stats = {}

        for utterance in transcript.utterances:

            speaker = utterance.speaker

            duration = (utterance.end – utterance.start) / 1000

            if speaker not in speaker_stats:

                speaker_stats[speaker] = {

                    ‘duration’: 0,

                    ‘turns’: 0,

                    ‘word_count’: 0

                }

            speaker_stats[speaker][‘duration’] += duration

            speaker_stats[speaker][‘turns’] += 1

            speaker_stats[speaker][‘word_count’] += len(utterance.text.split())

        return speaker_stats

    def _format_time(self, seconds):

        “””Format seconds to HH:MM:SS”””

        hours = int(seconds // 3600)

        minutes = int((seconds % 3600) // 60)

        secs = int(seconds % 60)

        return f”{hours:02d}:{minutes:02d}:{secs:02d}”

    def save_comprehensive_output(self, output, base_filename):

        “””Save all outputs to files”””

        # Save main transcript

        with open(f”{base_filename}_transcript.txt”, ‘w’, encoding=’utf-8′) as f:

            f.write(“MEETING TRANSCRIPT\n”)

            f.write(“=” * 70 + “\n\n”)

            for entry in output[‘transcript’]:

                f.write(f”[{entry[‘timestamp’]}] {entry[‘speaker’]}:\n”)

                f.write(f”{entry[‘text’]}\n\n”)

        # Save summary and insights

        with open(f”{base_filename}_insights.txt”, ‘w’, encoding=’utf-8′) as f:

            f.write(“MEETING INSIGHTS\n”)

            f.write(“=” * 70 + “\n\n”)

            f.write(“SUMMARY:\n”)

            f.write(f”{output[‘summary’]}\n\n”)

            f.write(“KEY HIGHLIGHTS:\n”)

            for highlight in output[‘highlights’][:5]:

                f.write(f”- {highlight[‘text’]}\n”)

            f.write(“\n”)

            f.write(“ACTION ITEMS:\n”)

            for action in output[‘action_items’]:

                f.write(f”- [{action[‘speaker’]}] {action[‘text’]}\n”)

            f.write(“\n”)

            f.write(“MAIN TOPICS:\n”)

            for topic in output[‘topics’][:5]:

                f.write(f”- {topic[‘topic’]} (relevance: {topic[‘relevance’]:.2f})\n”)

            f.write(“\n”)

            f.write(“SPEAKER STATISTICS:\n”)

            for speaker, stats in output[‘speakers’].items():

                f.write(f”Speaker {speaker}:\n”)

                f.write(f”  Duration: {stats[‘duration’]:.1f}s\n”)

                f.write(f”  Turns: {stats[‘turns’]}\n”)

                f.write(f”  Words: {stats[‘word_count’]}\n”)

        print(f”Saved outputs to {base_filename}_*.txt”)

Hybrid Approach: Best of Both Worlds

Combine real-time and post-meeting processing:

class HybridTranscriptionSystem:

    def __init__(self, realtime_api_key, batch_api_key):

        self.realtime = RealtimeTranscriptionBot(realtime_api_key)

        self.batch = PostMeetingTranscriber(batch_api_key)

        self.audio_recorder = []

    async def process_meeting(self, audio_stream):

        “””Process with both real-time and post-meeting”””

        # Start real-time transcription for live captions

        print(“Starting real-time transcription…”)

        realtime_task = asyncio.create_task(

            self.realtime.start_stream(audio_stream)

        )

        # Simultaneously record audio for post-processing

        print(“Recording audio for post-processing…”)

        recording_task = asyncio.create_task(

            self._record_audio(audio_stream)

        )

        # Wait for meeting to end

        await asyncio.gather(realtime_task, recording_task)

        print(“\nMeeting ended. Processing recording…”)

        # Save recording

        recording_file = “meeting_recording.wav”

        self._save_recording(recording_file)

        # Post-process for high accuracy

        transcript = self.batch.transcribe_recording(recording_file)

        output = self.batch.generate_comprehensive_output(transcript)

        return {

            ‘realtime’: self.realtime.get_transcript(),

            ‘final’: output

        }

    async def _record_audio(self, audio_stream):

        “””Record audio chunks for later processing”””

        while True:

            chunk = await audio_stream.read(3200)

            if not chunk:

                break

            self.audio_recorder.append(chunk)

    def _save_recording(self, filename):

        “””Save recorded audio to file”””

        import wave

        with wave.open(filename, ‘wb’) as wf:

            wf.setnchannels(1)

            wf.setsampwidth(2)

            wf.setframerate(16000)

            wf.writeframes(b”.join(self.audio_recorder))

        print(f”Recording saved: {filename}”)

Performance Comparison

Measure the differences between approaches:

import time

class TranscriptionBenchmark:

    def __init__(self):

        self.metrics = {}

    def benchmark_realtime(self, audio_stream):

        “””Benchmark real-time transcription”””

        start_time = time.time()

        # Simulate real-time processing

        latencies = []

        chunk_times = []

        for i in range(100):  # 100 chunks

            chunk_start = time.time()

            # Process chunk (simulated)

            time.sleep(0.2)  # 200ms chunks

            chunk_latency = time.time() – chunk_start

            latencies.append(chunk_latency)

        total_time = time.time() – start_time

        self.metrics[‘realtime’] = {

            ‘total_time’: total_time,

            ‘avg_latency’: sum(latencies) / len(latencies),

            ‘max_latency’: max(latencies),

            ‘min_latency’: min(latencies)

        }

        return self.metrics[‘realtime’]

    def benchmark_batch(self, audio_file):

        “””Benchmark batch transcription”””

        start_time = time.time()

        # Process entire file

        # (actual transcription would go here)

        total_time = time.time() – start_time

        self.metrics[‘batch’] = {

            ‘total_time’: total_time,

            ‘throughput’: ‘processing_time / audio_duration’

        }

        return self.metrics[‘batch’]

    def print_comparison(self):

        “””Print performance comparison”””

        print(“\nPERFORMANCE COMPARISON”)

        print(“=” * 60)

        print(“\nReal-Time:”)

        print(f”  Average latency: {self.metrics[‘realtime’][‘avg_latency’]*1000:.2f}ms”)

        print(f”  Max latency: {self.metrics[‘realtime’][‘max_latency’]*1000:.2f}ms”)

        print(“\nBatch Processing:”)

        print(f”  Total time: {self.metrics[‘batch’][‘total_time’]:.2f}s”)

        print(“\n” + “=” * 60)

Decision Framework

Choose real-time when you need:

  • Live captions during meetings
  • Instant feedback for accessibility
  • Interactive features (commands, questions)
  • Real-time moderation or translation

Choose post-meeting when you need:

  • Maximum accuracy for records
  • Detailed insights and summaries
  • Cost optimization (batch cheaper)
  • Non-urgent documentation

Use hybrid when you need both live captions and accurate records.

Usage Examples

# Real-time only

async def realtime_demo():

    bot = RealtimeTranscriptionBot(api_key=”your_key”)

    await bot.start_stream(audio_stream)

    bot.save_transcript(“realtime_transcript.txt”)

# Post-meeting only

def batch_demo():

    transcriber = PostMeetingTranscriber(api_key=”your_key”)

    transcript = transcriber.transcribe_recording(“meeting.wav”)

    output = transcriber.generate_comprehensive_output(transcript)

    transcriber.save_comprehensive_output(output, “meeting”)

# Hybrid approach

async def hybrid_demo():

    system = HybridTranscriptionSystem(

        realtime_api_key=”key1″,

        batch_api_key=”key2″

    )

    results = await system.process_meeting(audio_stream)

    # Get both real-time captions and final accurate transcript

Real-time transcription delivers instant results with 200-500ms latency but costs 2-3x more. 

Post-meeting processing achieves 95%+ accuracy with comprehensive insights but requires waiting. Choose based on your use case priorities—immediacy versus accuracy.

Conclusion

Real-time transcription excels at providing instant captions with acceptable accuracy, while post-meeting processing delivers superior accuracy with comprehensive insights and analytics choose based on whether you prioritize immediacy or precision.

If you want both capabilities without building complex systems, consider Meetstream.ai API, which provides optimized real-time streaming and high-accuracy batch processing with a single integration.

Leave a Reply

Your email address will not be published. Required fields are marked *