How to Build an Audio Transcription Bot for Google Meet

Google Meet dominates enterprise video conferencing, yet most organizations struggle to capture and document meeting discussions effectively. 

Building an automated transcription bot solves this challenge by joining meetings programmatically, recording conversations, and generating accurate transcripts. 

This guide shows you how to create a Google Meet transcription bot using Python and modern cloud APIs.

Understanding Google Meet Bot Architecture

Unlike Zoom’s dedicated bot SDK, Google Meet requires a different approach. Your bot will use Puppeteer or Selenium to control a headless browser, join meetings through the web interface, capture audio streams, and send them to a speech recognition service. We’ll use Playwright for browser automation and Assembly AI for transcription due to its superior accuracy and speaker identification.

Prerequisites and Setup

Install the required dependencies:

pip install playwright assemblyai python-dotenv asyncio

playwright install chromium

You’ll need:

  1. Google Workspace Account – For creating and joining meetings
  2. AssemblyAI API Key – Get from assemblyai.com
  3. Google OAuth Credentials – For bot authentication

Create your .env file:

GOOGLE_EMAIL=bot@yourdomain.com

GOOGLE_PASSWORD=your_bot_password

ASSEMBLYAI_API_KEY=your_api_key

Step 1: Automate Browser Login

First, create a module to authenticate and join Google Meet:

from playwright.async_api import async_playwright

import asyncio

import os

from dotenv import load_dotenv

load_dotenv()

class GoogleMeetBot:

    def __init__(self):

        self.email = os.getenv(“GOOGLE_EMAIL”)

        self.password = os.getenv(“GOOGLE_PASSWORD”)

        self.browser = None

        self.page = None

        self.context = None

    async def initialize_browser(self):

        “””Launch browser with audio capture enabled”””

        playwright = await async_playwright().start()

        self.browser = await playwright.chromium.launch(

            headless=False,  # Set True for production

            args=[

                ‘–use-fake-ui-for-media-stream’,

                ‘–use-fake-device-for-media-stream’,

                ‘–no-sandbox’,

                ‘–disable-setuid-sandbox’

            ]

        )

        self.context = await self.browser.new_context(

            permissions=[‘microphone’, ‘camera’],

            viewport={‘width’: 1280, ‘height’: 720}

        )

        self.page = await self.context.new_page()

        print(“Browser initialized”)

    async def login_google(self):

        “””Authenticate with Google account”””

        await self.page.goto(‘https://accounts.google.com’)

        # Enter email

        await self.page.fill(‘input[type=”email”]’, self.email)

        await self.page.click(‘#identifierNext’)

        await self.page.wait_for_timeout(2000)

        # Enter password

        await self.page.fill(‘input[type=”password”]’, self.password)

        await self.page.click(‘#passwordNext’)

        await self.page.wait_for_timeout(3000)

        print(“Logged in successfully”)

    async def join_meeting(self, meeting_url):

        “””Join a Google Meet meeting”””

        await self.page.goto(meeting_url)

        await self.page.wait_for_timeout(3000)

        # Disable camera and microphone prompts

        try:

            await self.page.click(‘button[aria-label*=”camera”]’)

            await self.page.click(‘button[aria-label*=”microphone”]’)

        except:

            pass

        # Click join button

        await self.page.click(‘button:has-text(“Ask to join”)’)

        await self.page.wait_for_timeout(5000)

        print(f”Joined meeting: {meeting_url}”)

Step 2: Capture Audio Streams

Implement audio capture using browser audio APIs:

import subprocess

import threading

class AudioStreamCapture:

    def __init__(self, output_file=”meet_audio.wav”):

        self.output_file = output_file

        self.is_recording = False

        self.process = None

    def start_capture(self):

        “””Start capturing system audio using FFmpeg”””

        self.is_recording = True

        # FFmpeg command for audio capture

        ffmpeg_cmd = [

            ‘ffmpeg’,

            ‘-f’, ‘pulse’,  # Use ‘avfoundation’ on macOS

            ‘-i’, ‘default’,

            ‘-acodec’, ‘pcm_s16le’,

            ‘-ar’, ‘16000’,

            ‘-ac’, ‘1’,

            self.output_file,

            ‘-y’

        ]

        self.process = subprocess.Popen(

            ffmpeg_cmd,

            stdout=subprocess.PIPE,

            stderr=subprocess.PIPE

        )

        print(f”Audio capture started: {self.output_file}”)

    def stop_capture(self):

        “””Stop audio recording”””

        if self.process:

            self.process.terminate()

            self.process.wait()

            self.is_recording = False

            print(“Audio capture stopped”)

    def get_audio_file(self):

        “””Return the path to recorded audio”””

        return self.output_file

Step 3: Implement Real-Time Transcription

Connect captured audio to Assembly AI’s streaming transcription:

import assemblyai as aai

import json

from datetime import datetime

class MeetTranscriber:

    def __init__(self):

        aai.settings.api_key = os.getenv(“ASSEMBLYAI_API_KEY”)

        self.transcripts = []

        self.current_speakers = {}

    def transcribe_file(self, audio_file):

        “””Transcribe recorded audio file”””

        config = aai.TranscriptionConfig(

            speaker_labels=True,

            speakers_expected=5,

            language_code=”en_us”

        )

        transcriber = aai.Transcriber()

        transcript = transcriber.transcribe(audio_file, config=config)

        if transcript.status == aai.TranscriptStatus.error:

            raise Exception(f”Transcription error: {transcript.error}”)

        return transcript

    def format_transcript(self, transcript):

        “””Format transcript with speakers and timestamps”””

        formatted_output = []

        formatted_output.append(“Google Meet Transcript”)

        formatted_output.append(“=” * 60)

        formatted_output.append(f”Generated: {datetime.now().strftime(‘%Y-%m-%d %H:%M:%S’)}”)

        formatted_output.append(“”)

        for utterance in transcript.utterances:

            timestamp = self._format_timestamp(utterance.start)

            speaker = f”Speaker {utterance.speaker}”

            text = utterance.text

            formatted_output.append(f”[{timestamp}] {speaker}:”)

            formatted_output.append(f”{text}”)

            formatted_output.append(“”)

        return “\n”.join(formatted_output)

    def _format_timestamp(self, milliseconds):

        “””Convert milliseconds to HH:MM:SS format”””

        seconds = milliseconds // 1000

        hours = seconds // 3600

        minutes = (seconds % 3600) // 60

        secs = seconds % 60

        return f”{hours:02d}:{minutes:02d}:{secs:02d}”

    def extract_action_items(self, transcript):

        “””Extract potential action items from transcript”””

        action_keywords = [‘todo’, ‘action item’, ‘follow up’, ‘will do’, ‘need to’]

        action_items = []

        for utterance in transcript.utterances:

            text_lower = utterance.text.lower()

            if any(keyword in text_lower for keyword in action_keywords):

                action_items.append({

                    ‘speaker’: f”Speaker {utterance.speaker}”,

                    ‘text’: utterance.text,

                    ‘timestamp’: self._format_timestamp(utterance.start)

                })

        return action_items

    def save_transcript(self, transcript, filename):

        “””Save formatted transcript to file”””

        formatted = self.format_transcript(transcript)

        with open(filename, ‘w’, encoding=’utf-8′) as f:

            f.write(formatted)

        # Save action items separately

        action_items = self.extract_action_items(transcript)

        if action_items:

            action_file = filename.replace(‘.txt’, ‘_actions.txt’)

            with open(action_file, ‘w’, encoding=’utf-8′) as f:

                f.write(“Action Items & Follow-ups\n”)

                f.write(“=” * 60 + “\n\n”)

                for item in action_items:

                    f.write(f”[{item[‘timestamp’]}] {item[‘speaker’]}:\n”)

                    f.write(f”{item[‘text’]}\n\n”)

        print(f”Transcript saved: {filename}”)

Step 4: Build the Complete Bot System

Integrate all components into a functional transcription bot:

import asyncio

import signal

import sys

class GoogleMeetTranscriptionBot:

    def __init__(self):

        self.meet_bot = GoogleMeetBot()

        self.audio_capture = AudioStreamCapture()

        self.transcriber = MeetTranscriber()

        self.is_running = False

    async def start(self, meeting_url):

        “””Start the bot and join meeting”””

        print(“Initializing Google Meet Transcription Bot…”)

        # Initialize browser and login

        await self.meet_bot.initialize_browser()

        await self.meet_bot.login_google()

        # Join the meeting

        await self.meet_bot.join_meeting(meeting_url)

        # Start audio capture

        self.audio_capture.start_capture()

        self.is_running = True

        print(“Bot is now recording and will transcribe on exit…”)

    async def stop(self):

        “””Stop bot and generate transcript”””

        print(“\nStopping bot and generating transcript…”)

        self.is_running = False

        self.audio_capture.stop_capture()

        # Transcribe the recording

        audio_file = self.audio_capture.get_audio_file()

        print(“Transcribing audio… This may take a few minutes.”)

        transcript = self.transcriber.transcribe_file(audio_file)

        # Save transcript

        timestamp = datetime.now().strftime(‘%Y%m%d_%H%M%S’)

        output_file = f”meet_transcript_{timestamp}.txt”

        self.transcriber.save_transcript(transcript, output_file)

        # Close browser

        if self.meet_bot.browser:

            await self.meet_bot.browser.close()

        print(“Transcription complete!”)

    async def run(self, meeting_url, duration=None):

        “””Run the bot for specified duration or until interrupted”””

        await self.start(meeting_url)

        try:

            if duration:

                await asyncio.sleep(duration)

            else:

                # Wait for manual interruption

                while self.is_running:

                    await asyncio.sleep(1)

        except KeyboardInterrupt:

            pass

        finally:

            await self.stop()

# Main execution

async def main():

    bot = GoogleMeetTranscriptionBot()

    # Replace with your meeting URL

    meeting_url = “https://meet.google.com/abc-defg-hij”

    # Setup graceful shutdown

    def signal_handler(sig, frame):

        asyncio.create_task(bot.stop())

    signal.signal(signal.SIGINT, signal_handler)

    try:

        # Run for 60 minutes or until Ctrl+C

        await bot.run(meeting_url, duration=3600)

    except Exception as e:

        print(f”Error: {e}”)

        await bot.stop()

if __name__ == “__main__”:

    asyncio.run(main())

Step 5: Add Advanced Features

Enhance your bot with sentiment analysis and summary generation:

class AdvancedTranscriber(MeetTranscriber):

    def generate_summary(self, transcript):

        “””Generate meeting summary”””

        # Get full text

        full_text = ” “.join([u.text for u in transcript.utterances])

        # Use AssemblyAI’s summarization

        config = aai.TranscriptionConfig(

            summarization=True,

            summary_model=aai.SummarizationModel.informative,

            summary_type=aai.SummarizationType.bullets

        )

        return transcript.summary if hasattr(transcript, ‘summary’) else None

    def analyze_sentiment(self, transcript):

        “””Analyze sentiment throughout meeting”””

        config = aai.TranscriptionConfig(

            sentiment_analysis=True

        )

        sentiments = []

        for utterance in transcript.utterances:

            if hasattr(utterance, ‘sentiment’):

                sentiments.append({

                    ‘timestamp’: self._format_timestamp(utterance.start),

                    ‘speaker’: f”Speaker {utterance.speaker}”,

                    ‘sentiment’: utterance.sentiment,

                    ‘text’: utterance.text

                })

        return sentiments

Deployment and Production Tips

Use a dedicated Google Workspace account for your bot to avoid interrupting personal meetings. Deploy on a cloud VM with sufficient resources—at minimum 2GB RAM and 2 CPU cores. Implement webhook listeners to automatically join scheduled meetings.

Store transcripts in cloud storage like Google Cloud Storage or AWS S3 for team access. Add monitoring with services like Sentry to track errors and bot performance. Implement retry logic for network failures and meeting access issues.

For security, encrypt stored transcripts and use environment variables for all credentials. Consider implementing role-based access control if multiple team members need transcript access.

Your Google Meet transcription bot now automatically joins meetings, captures conversations, and generates speaker-labeled transcripts with action items and sentiment analysis.

Conclusion

Building a custom Google Meet transcription bot provides full control over your meeting documentation pipeline and data privacy. However, managing browser automation, audio capture, and multiple API integrations requires ongoing maintenance and infrastructure management.

If you prefer a ready-to-use solution, consider Meetstream.ai API, which provides enterprise-grade transcription for Google Meet, Zoom, and Microsoft Teams without the complexity of building and maintaining your own bot infrastructure.

Leave a Reply

Your email address will not be published. Required fields are marked *