How to Build an Audio Transcription Bot for Microsoft Teams

Microsoft Teams powers collaboration for millions of organizations worldwide, yet capturing meeting discussions remains a manual, time-consuming process. Building an automated transcription bot eliminates this burden by joining meetings programmatically, recording conversations, and generating accurate transcripts. This guide demonstrates how to create a production-ready Teams transcription bot using Microsoft’s Bot Framework and Azure services.

Understanding Teams Bot Architecture

Microsoft Teams provides a robust Bot Framework that allows bots to join meetings as participants. Your bot will register with Azure Bot Service, authenticate using Microsoft Graph API, join meetings through the Teams client, capture audio streams, and process them through Azure Cognitive Services or third-party transcription APIs.

Prerequisites and Azure Setup

Install the required dependencies:

pip install botbuilder-core botbuilder-schema azure-cognitiveservices-speech msal requests python-dotenv aiohttp

Create an Azure Bot and register your application:

  1. Register an app in Azure Active Directory
  2. Create an Azure Bot resource
  3. Enable Microsoft Teams channel
  4. Configure bot permissions for meetings

Set up your .env file:

MICROSOFT_APP_ID=your_app_id

MICROSOFT_APP_PASSWORD=your_app_password

TENANT_ID=your_tenant_id

AZURE_SPEECH_KEY=your_speech_key

AZURE_SPEECH_REGION=your_region

Step 1: Implement Bot Authentication

Create authentication handler for Microsoft Graph API:

import msal

import os

from dotenv import load_dotenv

load_dotenv()

class TeamsAuthProvider:

    def __init__(self):

        self.client_id = os.getenv(“MICROSOFT_APP_ID”)

        self.client_secret = os.getenv(“MICROSOFT_APP_PASSWORD”)

        self.tenant_id = os.getenv(“TENANT_ID”)

        self.authority = f”https://login.microsoftonline.com/{self.tenant_id}”

        self.scope = [“https://graph.microsoft.com/.default”]

    def get_access_token(self):

        “””Acquire access token for Microsoft Graph API”””

        app = msal.ConfidentialClientApplication(

            self.client_id,

            authority=self.authority,

            client_credential=self.client_secret

        )

        result = app.acquire_token_for_client(scopes=self.scope)

        if “access_token” in result:

            return result[“access_token”]

        else:

            raise Exception(f”Authentication failed: {result.get(‘error_description’)}”)

    def get_app_token(self):

        “””Get bot framework authentication token”””

        token_url = “https://login.microsoftonline.com/botframework.com/oauth2/v2.0/token”

        data = {

            ‘grant_type’: ‘client_credentials’,

            ‘client_id’: self.client_id,

            ‘client_secret’: self.client_secret,

            ‘scope’: ‘https://api.botframework.com/.default’

        }

        import requests

        response = requests.post(token_url, data=data)

        if response.status_code == 200:

            return response.json()[‘access_token’]

        else:

            raise Exception(f”Token acquisition failed: {response.text}”)

Step 2: Create the Teams Bot Core

Build the bot that responds to Teams events and joins meetings:

from botbuilder.core import ActivityHandler, TurnContext

from botbuilder.schema import Activity, ActivityTypes

import asyncio

class TranscriptionBot(ActivityHandler):

    def __init__(self):

        self.auth_provider = TeamsAuthProvider()

        self.active_meetings = {}

    async def on_message_activity(self, turn_context: TurnContext):

        “””Handle incoming messages”””

        text = turn_context.activity.text.strip().lower()

        if text == “join”:

            await turn_context.send_activity(“Please invite me to a meeting!”)

        elif text == “status”:

            status = f”Active transcriptions: {len(self.active_meetings)}”

            await turn_context.send_activity(status)

        else:

            await turn_context.send_activity(

                “Send ‘join’ to start transcription or ‘status’ to check active sessions”

            )

    async def on_teams_meeting_start(self, meeting_start_event, turn_context: TurnContext):

        “””Handle meeting start event”””

        meeting_id = meeting_start_event.meeting.id

        print(f”Meeting started: {meeting_id}”)

        # Join the meeting and start transcription

        await self.join_meeting(meeting_id, turn_context)

    async def on_teams_meeting_end(self, meeting_end_event, turn_context: TurnContext):

        “””Handle meeting end event”””

        meeting_id = meeting_end_event.meeting.id

        print(f”Meeting ended: {meeting_id}”)

        # Stop transcription and generate final transcript

        if meeting_id in self.active_meetings:

            await self.stop_transcription(meeting_id)

    async def join_meeting(self, meeting_id, turn_context):

        “””Join Teams meeting programmatically”””

        try:

            # Store meeting context

            self.active_meetings[meeting_id] = {

                ‘context’: turn_context,

                ‘start_time’: asyncio.get_event_loop().time()

            }

            await turn_context.send_activity(

                “Transcription bot has joined the meeting”

            )

            print(f”Successfully joined meeting: {meeting_id}”)

        except Exception as e:

            print(f”Error joining meeting: {e}”)

            await turn_context.send_activity(

                f”Failed to join meeting: {str(e)}”

            )

Step 3: Implement Audio Capture and Processing

Create audio stream handler for Teams meetings:

import azure.cognitiveservices.speech as speechsdk

from datetime import datetime

class TeamsAudioProcessor:

    def __init__(self):

        self.speech_key = os.getenv(“AZURE_SPEECH_KEY”)

        self.speech_region = os.getenv(“AZURE_SPEECH_REGION”)

        self.transcript_buffer = []

    def create_speech_config(self):

        “””Configure Azure Speech Services”””

        speech_config = speechsdk.SpeechConfig(

            subscription=self.speech_key,

            region=self.speech_region

        )

        speech_config.speech_recognition_language = “en-US”

        speech_config.request_word_level_timestamps()

        speech_config.enable_dictation()

        return speech_config

    async def start_transcription(self, audio_stream):

        “””Start real-time transcription from audio stream”””

        speech_config = self.create_speech_config()

        # Configure audio input

        audio_config = speechsdk.audio.AudioConfig(stream=audio_stream)

        # Create conversation transcriber for speaker identification

        conversation_transcriber = speechsdk.transcription.ConversationTranscriber(

            speech_config=speech_config,

            audio_config=audio_config

        )

        # Event handlers

        conversation_transcriber.transcribed.connect(self._on_transcribed)

        conversation_transcriber.canceled.connect(self._on_canceled)

        conversation_transcriber.session_started.connect(self._on_session_started)

        conversation_transcriber.session_stopped.connect(self._on_session_stopped)

        # Start transcription

        await conversation_transcriber.start_transcribing_async()

        print(“Transcription started”)

        return conversation_transcriber

    def _on_transcribed(self, evt):

        “””Handle transcribed text”””

        if evt.result.reason == speechsdk.ResultReason.RecognizedSpeech:

            speaker_id = evt.result.speaker_id

            text = evt.result.text

            offset = evt.result.offset

            entry = {

                ‘timestamp’: self._format_timestamp(offset),

                ‘speaker’: f”Speaker {speaker_id}”,

                ‘text’: text

            }

            self.transcript_buffer.append(entry)

            print(f”[{entry[‘timestamp’]}] {entry[‘speaker’]}: {entry[‘text’]}”)

    def _on_canceled(self, evt):

        “””Handle transcription errors”””

        print(f”Transcription canceled: {evt.reason}”)

        if evt.reason == speechsdk.CancellationReason.Error:

            print(f”Error details: {evt.error_details}”)

    def _on_session_started(self, evt):

        “””Handle session start”””

        print(“Transcription session started”)

    def _on_session_stopped(self, evt):

        “””Handle session stop”””

        print(“Transcription session stopped”)

    def _format_timestamp(self, offset_ticks):

        “””Convert ticks to readable timestamp”””

        seconds = offset_ticks / 10000000

        minutes = int(seconds // 60)

        secs = int(seconds % 60)

        return f”{minutes:02d}:{secs:02d}”

    def save_transcript(self, meeting_id, filename=None):

        “””Save transcript to file”””

        if not filename:

            timestamp = datetime.now().strftime(‘%Y%m%d_%H%M%S’)

            filename = f”teams_transcript_{meeting_id}_{timestamp}.txt”

        with open(filename, ‘w’, encoding=’utf-8′) as f:

            f.write(“Microsoft Teams Meeting Transcript\n”)

            f.write(“=” * 70 + “\n”)

            f.write(f”Meeting ID: {meeting_id}\n”)

            f.write(f”Generated: {datetime.now().strftime(‘%Y-%m-%d %H:%M:%S’)}\n”)

            f.write(“=” * 70 + “\n\n”)

            for entry in self.transcript_buffer:

                f.write(f”[{entry[‘timestamp’]}] {entry[‘speaker’]}:\n”)

                f.write(f”{entry[‘text’]}\n\n”)

        print(f”Transcript saved: {filename}”)

        return filename

Step 4: Handle Teams Meeting Media Streams

Implement media stream handling for audio capture:

import aiohttp

import json

class TeamsMediaHandler:

    def __init__(self, bot_id, auth_provider):

        self.bot_id = bot_id

        self.auth_provider = auth_provider

        self.media_sessions = {}

    async def subscribe_to_audio(self, meeting_id, participant_id):

        “””Subscribe to audio stream from Teams meeting”””

        token = self.auth_provider.get_access_token()

        headers = {

            ‘Authorization’: f’Bearer {token}’,

            ‘Content-Type’: ‘application/json’

        }

        # Subscribe to audio stream

        subscription_data = {

            ‘resource’: f’/communications/calls/{meeting_id}/audioStreams’,

            ‘changeType’: ‘created,updated’,

            ‘notificationUrl’: f’https://your-bot-endpoint.com/api/notifications’,

            ‘clientState’: ‘transcription-bot-secret’

        }

        async with aiohttp.ClientSession() as session:

            async with session.post(

                ‘https://graph.microsoft.com/v1.0/subscriptions’,

                headers=headers,

                json=subscription_data

            ) as response:

                if response.status == 201:

                    subscription = await response.json()

                    self.media_sessions[meeting_id] = subscription

                    print(f”Subscribed to audio stream: {meeting_id}”)

                    return subscription

                else:

                    error = await response.text()

                    raise Exception(f”Subscription failed: {error}”)

    async def get_audio_stream(self, meeting_id):

        “””Retrieve audio stream data”””

        if meeting_id not in self.media_sessions:

            raise Exception(f”No active subscription for meeting: {meeting_id}”)

        token = self.auth_provider.get_access_token()

        headers = {

            ‘Authorization’: f’Bearer {token}’

        }

        stream_url = f’https://graph.microsoft.com/v1.0/communications/calls/{meeting_id}/audioStream’

        async with aiohttp.ClientSession() as session:

            async with session.get(stream_url, headers=headers) as response:

                if response.status == 200:

                    return response.content

                else:

                    error = await response.text()

                    raise Exception(f”Failed to get audio stream: {error}”)

Step 5: Build the Complete Integration

Combine all components into a working transcription bot:

from aiohttp import web

from botbuilder.core import BotFrameworkAdapter, BotFrameworkAdapterSettings

from botbuilder.schema import Activity

class TeamsTranscriptionBotApp:

    def __init__(self):

        self.settings = BotFrameworkAdapterSettings(

            app_id=os.getenv(“MICROSOFT_APP_ID”),

            app_password=os.getenv(“MICROSOFT_APP_PASSWORD”)

        )

        self.adapter = BotFrameworkAdapter(self.settings)

        self.bot = TranscriptionBot()

        self.audio_processor = TeamsAudioProcessor()

        self.media_handler = TeamsMediaHandler(

            self.settings.app_id,

            TeamsAuthProvider()

        )

    async def messages_handler(self, req: web.Request) -> web.Response:

        “””Handle incoming messages from Teams”””

        if “application/json” in req.headers[“Content-Type”]:

            body = await req.json()

        else:

            return web.Response(status=415)

        activity = Activity().deserialize(body)

        auth_header = req.headers.get(“Authorization”, “”)

        async def call_bot(turn_context):

            await self.bot.on_turn(turn_context)

        await self.adapter.process_activity(activity, auth_header, call_bot)

        return web.Response(status=200)

    async def start_bot_transcription(self, meeting_id):

        “””Initialize transcription for a meeting”””

        try:

            # Subscribe to media stream

            await self.media_handler.subscribe_to_audio(meeting_id, self.settings.app_id)

            # Get audio stream

            audio_stream = await self.media_handler.get_audio_stream(meeting_id)

            # Start transcription

            await self.audio_processor.start_transcription(audio_stream)

            print(f”Transcription active for meeting: {meeting_id}”)

        except Exception as e:

            print(f”Error starting transcription: {e}”)

    async def stop_bot_transcription(self, meeting_id):

        “””Stop transcription and save results”””

        try:

            # Save transcript

            filename = self.audio_processor.save_transcript(meeting_id)

            print(f”Transcription completed: {filename}”)

            return filename

        except Exception as e:

            print(f”Error stopping transcription: {e}”)

    def run(self, host=’0.0.0.0′, port=3978):

        “””Start the bot web server”””

        app = web.Application()

        app.router.add_post(‘/api/messages’, self.messages_handler)

        web.run_app(app, host=host, port=port)

        print(f”Bot running on {host}:{port}”)

# Main execution

if __name__ == “__main__”:

    bot_app = TeamsTranscriptionBotApp()

    bot_app.run()

Deployment and Configuration

Deploy your bot to Azure App Service or Azure Container Instances for production use. Configure your bot endpoint in Azure Bot Service and update the messaging endpoint URL. Enable the Teams channel and configure meeting event subscriptions.

Set up ngrok for local development testing:

ngrok http 3978

Update your bot’s messaging endpoint to the ngrok URL in Azure Portal.

For production, implement secure storage for transcripts using Azure Blob Storage. Add monitoring with Application Insights to track bot performance and errors. Configure auto-scaling to handle multiple concurrent meetings.

Security Best Practices

Store all credentials in Azure Key Vault instead of environment variables. Implement rate limiting to prevent abuse. Use managed identities for Azure resource access. Enable encryption for stored transcripts and implement role-based access control for transcript retrieval.

Your Microsoft Teams transcription bot now automatically joins meetings, captures audio, generates speaker-labeled transcripts, and integrates seamlessly with the Teams ecosystem.

Conclusion

Building a custom Microsoft Teams transcription bot gives you complete control over meeting documentation and data handling, with deep integration into your Azure infrastructure. However, managing bot registration, media streams, Azure services, and compliance requirements demands significant development and maintenance effort.

If you want enterprise-grade transcription without the complexity, consider Meetstream.ai API, which provides ready-to-use transcription for Microsoft Teams, Zoom, and Google Meet with simple API integration.

Leave a Reply

Your email address will not be published. Required fields are marked *