Best Practices for building Meeting Notetakers

Introduction

Building a robust meeting notetaker requires careful consideration of accuracy, latency, speaker identification, and real-time capabilities. This guide addresses common questions and provides practical solutions for both post-call and live meeting transcription scenarios.

Why AssemblyAI for Meeting Notetakers?

AssemblyAI stands out as the premier choice for meeting notetakers with several key advantages:

Industry-Leading Accuracy with Pre-recorded Audio

93.3%+ transcription accuracy ensures reliable meeting documentation
2.9% speaker diarization error rate for precise “who said what” attribution
Speech Understanding integration for intelligent post-processing and insights
Keyterms prompt allows providing meeting context to improve accuracy of transcription

Streaming with Universal-3.5 Pro

As meeting notetakers evolve toward real-time capabilities, AssemblyAI’s Universal-3.5 Pro Streaming model (universal-3-5-pro) offers significant benefits:

Speaker diarization available for both pre-recorded and streaming transcription
Ultra-low latency (~300ms) enables live transcription without delays
Format turns feature provides structured, readable output in real-time
Keyterms prompt allows providing meeting context to improve accuracy of transcription

End-to-End Voice AI Platform

Unlike fragmented solutions, AssemblyAI provides a unified API for:

Transcription with speaker diarization
Automatic language detection and code switching
Boosting accuracy via meeting context with keyterms prompt
Speech Understanding tasks like speaker identification, translation, and transcript styling
Post-processing workflows with custom prompting - from summarization to completely custom workflows
Real-time and batch processing of pre-recorded audio in a single platform

When Should I Use Pre-recorded vs Streaming for Meeting Notetakers?

Understanding when to use pre-recorded versus streaming speech-to-text is critical for building the right meeting notetaker.

Pre-recorded Speech-to-text

Post-call analysis - Meeting already happened, you have the full recording

Highest accuracy needed - Pre-recorded models have higher accuracy (93.3%+)
Speaker diarization is critical - Pre-recorded has 2.9% speaker error rate
Broad language support - Need any of 99+ languages
Advanced features required - Summarization, sentiment analysis, entity detection, PII redaction, speaker identification
Batch processing - Processing multiple recordings at once
Quality over speed - Can wait seconds/minutes for perfect results

Best for: Zoom/Teams/Meet recording uploads, compliance, documentation, post-call summaries, searchable archives

Streaming Speech-to-text

Live meetings - Transcribing as the meeting happens You should use streaming when you need to display a live transcript of text to users as they are speaking. With Universal-3.5 Pro Streaming, accuracy is closer to pre-recorded, but pre-recorded will always be the most accurate option.

Real-time captions - Displaying subtitles/captions to participants during calls
Immediate feedback - Need transcription within ~300ms
Interactive features - Live note-taking, real-time keyword detection, action item alerts
No recording available - Processing live audio only

Best for: Live captions, real-time note-taking apps, accessibility features, live keyword alerts

Streaming is billed per sessionStreaming is billed on the total duration that your WebSocket connection stays open, not on the amount of audio you send. For long-running meetings, make sure to terminate sessions when the meeting ends to avoid being billed for idle time. See Billing and pricing for details.

Hybrid Approach (Recommended)

Many successful meeting notetakers use both pre-recorded and streaming speech-to-text:

Streaming during the call - Provide live captions and real-time notes to participants
Pre-recorded after the call - Generate high-quality transcript with speaker labels, summary, and insights

This gives users immediate value during meetings while providing comprehensive documentation afterward. Example workflow:

User joins meeting → Start streaming for live captions
Meeting ends → Upload recording to pre-recorded API for final transcript with speaker names
Generate meeting summary, action items, and searchable archive from pre-recorded transcript

What Languages and Features for a Meeting Notetaker?

Pre-Recorded Meetings

For post-call analysis, AssemblyAI supports: Languages:

99 languages supported
Automatic Language Detection to route to the most spoken language
Code Switching to preserve changes in speech between languages

Core Features:

Speaker diarization (1-10 speakers by default, expandable to any min/max)
Multichannel audio support (each channel = one speaker)
Automatic formatting, punctuation, and capitalization
Keyterms prompting for boosting domain-specific terms

Speech Understanding Models:

Summarization for meeting recaps
Sentiment analysis for meeting tone assessment
Entity detection for extracting key information
Speaker identification to map generic labels to actual names/roles
Translation between 99+ languages

Real-Time Streaming

For live meeting transcription: Languages:

English-only model (default)
Multilingual model supporting English, Spanish, French, German, Portuguese, and Italian

Streaming (Universal-3.5 Pro Streaming)

Speaker diarization for identifying who is speaking
Partial and final transcripts for responsive UI
Format turns for structured, readable output
Keyterms prompt for contextual accuracy

See the Universal-3.5 Pro Streaming documentation for full details.

How Can I Get Started Building a Post-Call Meeting Notetaker?

Here’s a complete example implementing pre-recorded transcription with all essential features:

import assemblyai as aai
import asyncio
from typing import Dict, List
from assemblyai.types import (
    SpeakerOptions,
    LanguageDetectionOptions,
    PIIRedactionPolicy,
    PIISubstitutionPolicy,
)

# Configure API key
aai.settings.api_key = "your_api_key_here"

async def transcribe_meeting_async(audio_source: str) -> Dict:
    """
    Asynchronously transcribe a meeting recording with full features

    Args:
        audio_source: Either a local file path or publicly accessible URL
    """
    # Configure comprehensive meeting analysis
    config = aai.TranscriptionConfig(
        # Speaker diarization
        speaker_labels=True,
        speakers_expected=None,  # Use if you know exact number from Zoom/Meet/Teams
        speaker_options=SpeakerOptions(
            min_speakers_expected=2,
            max_speakers_expected=10  # Set a bit higher than expected; too high can cause over-splitting
        ),
        multichannel=False,  # Set to True if audio has separate channel per speaker

        # Language detection
        language_detection=True,  # Auto-detect the most used language
        language_detection_options=LanguageDetectionOptions(
            code_switching=True,  # Preserve language switches
            code_switching_confidence_threshold=0.5,
        ),

        # Punctuation and formatting
        punctuate=True,
        format_text=True,

        # Boost accuracy of meeting-specific vocabulary
        keyterms_prompt=["quarterly", "KPI", "roadmap", "deliverables"],

        # Speech Understanding - commonly used models
        summarization=True,
        sentiment_analysis=True,
        entity_detection=True,
        redact_pii=True,
        redact_pii_policies=[
            PIIRedactionPolicy.person_name,
            PIIRedactionPolicy.organization,
            PIIRedactionPolicy.occupation,
        ],
        redact_pii_sub=PIISubstitutionPolicy.hash,
        redact_pii_audio=True
    )

    # Create transcriber
    transcriber = aai.Transcriber()

    try:
        # Submit transcription job
        transcript = await asyncio.to_thread(
            transcriber.transcribe,
            audio_source,
            config=config
        )

        # Check status
        if transcript.status == aai.TranscriptStatus.error:
            raise Exception(f"Transcription failed: {transcript.error}")

        # Process speaker-labeled utterances
        print("\n=== SPEAKER-LABELED TRANSCRIPT ===\n")

        for utterance in transcript.utterances:
            # Format timestamp
            start_time = utterance.start / 1000  # Convert to seconds
            end_time = utterance.end / 1000

            # Print formatted utterance
            print(f"[{start_time:.1f}s - {end_time:.1f}s] Speaker {utterance.speaker}:")
            print(f"  {utterance.text}")
            print(f"  Confidence: {utterance.confidence:.2%}\n")

        # Print summary data
        print("\n=== MEETING SUMMARY ===\n")
        print({
            "id": transcript.id,
            "status": transcript.status,
            "duration": transcript.audio_duration,
            "speaker_count": len(set(u.speaker for u in transcript.utterances)),
            "word_count": len(transcript.words) if transcript.words else 0,
            "detected_language": transcript.language_code if hasattr(transcript, 'language_code') else None,
            "summary": transcript.summary,
        })

        return {
            "transcript": transcript,
            "utterances": transcript.utterances,
            "summary": transcript.summary,
        }

    except Exception as e:
        print(f"Error during transcription: {e}")
        raise

async def main():
    """
    Example usage with error handling
    """
    # Use either local file OR URL (not both)
    audio_source = "https://assembly.ai/wildfires.mp3"  # Or "path/to/recording.mp3"

    try:
        result = await transcribe_meeting_async(audio_source)

        # Additional processing
        print(f"\nTotal speakers identified: {len(set(u.speaker for u in result['utterances']))}")
        print(f"Meeting duration: {result['transcript'].audio_duration} seconds")

    except Exception as e:
        print(f"Failed to process meeting: {e}")

if __name__ == "__main__":
    asyncio.run(main())

How Can I Get Started Building a During-Call Live Meeting Notetaker?

Here’s a complete example for real-time streaming transcription with meeting-optimized settings:

# pip install pyaudio websocket-client
import pyaudio
import websocket
import json
import threading
import time
from urllib.parse import urlencode
from datetime import datetime

# --- Configuration ---
YOUR_API_KEY = "your_api_key"

# Keyterms to improve recognition accuracy
KEYTERMS = [
    "Alice Johnson",
    "Bob Smith",
    "Carol Davis",
    "quarterly review",
    "action items",
    "follow up",
    "deadline",
    "budget"
]

# MEETING NOTETAKER CONFIGURATION (different from voice agents!)
CONNECTION_PARAMS = {
    "sample_rate": 16000,
    "speech_model": "universal-3-5-pro",
    "format_turns": True,  # ALWAYS TRUE for meetings - users need readable text

    # Meeting-optimized turn detection (wait longer than voice agents)
    # universal-3-5-pro defaults: min_turn_silence=100ms, max_turn_silence=1000ms
    "min_turn_silence": 560,  # Wait longer for natural pauses (voice agents use ~100ms)
    "max_turn_silence": 2000,  # Allow thinking pauses

    # Keyterms for accuracy - pass each term as a separate query parameter
    "keyterms_prompt": KEYTERMS,
}

API_ENDPOINT_BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
API_ENDPOINT = f"{API_ENDPOINT_BASE_URL}?{urlencode(CONNECTION_PARAMS, doseq=True)}"

# Audio Configuration
FRAMES_PER_BUFFER = 800  # 50ms of audio
SAMPLE_RATE = CONNECTION_PARAMS["sample_rate"]
CHANNELS = 1
FORMAT = pyaudio.paInt16

# Global variables
audio = None
stream = None
ws_app = None
audio_thread = None
stop_event = threading.Event()
transcript_buffer = []


def on_open(ws):
    """Called when the WebSocket connection is established."""
    print("=" * 80)
    print(f"[{datetime.now().strftime('%H:%M:%S')}] Meeting transcription started")
    print(f"Connected to: {API_ENDPOINT_BASE_URL}")
    print(f"Keyterms configured: {', '.join(KEYTERMS)}")
    print("=" * 80)
    print("\nSpeak into your microphone. Press Ctrl+C to stop.\n")

    def stream_audio():
        """Stream audio from microphone to WebSocket"""
        global stream
        while not stop_event.is_set():
            try:
                audio_data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
                ws.send(audio_data, websocket.ABNF.OPCODE_BINARY)
            except Exception as e:
                if not stop_event.is_set():
                    print(f"Error streaming audio: {e}")
                break

    global audio_thread
    audio_thread = threading.Thread(target=stream_audio)
    audio_thread.daemon = True
    audio_thread.start()


def on_message(ws, message):
    """Handle incoming messages from AssemblyAI"""
    try:
        data = json.loads(message)
        msg_type = data.get("type")

        # Uncomment to see full JSON for debugging:
        # print("=" * 80)
        # print(json.dumps(data, indent=2, ensure_ascii=False))
        # print("=" * 80)
        # print()

        if msg_type == "Begin":
            session_id = data.get("id", "N/A")
            print(f"[SESSION] Started - ID: {session_id}\n")

        elif msg_type == "Turn":
            end_of_turn = data.get("end_of_turn", False)
            transcript = data.get("transcript", "")
            turn_order = data.get("turn_order", 0)
            end_of_turn_confidence = data.get("end_of_turn_confidence", 0.0)

            # FOR MEETING NOTETAKERS: Show partials for responsive UI
            if not end_of_turn and transcript:
                print(f"\r[LIVE] {transcript}", end="", flush=True)

            # FOR MEETING NOTETAKERS: Use formatted finals for readable display
            # (Unlike voice agents which should use utterance for speed)
            if end_of_turn and transcript:
                timestamp = datetime.now().strftime('%H:%M:%S')
                print(f"\n[{timestamp}] {transcript}")
                print(f"           Turn: {turn_order} | Confidence: {end_of_turn_confidence:.2%}")

                # Detect action items
                transcript_lower = transcript.lower()
                if any(term in transcript_lower for term in ["action item", "follow up", "deadline", "assigned to", "todo"]):
                    print("           ⚠️  ACTION ITEM DETECTED!")

                # Store final transcript
                transcript_buffer.append({
                    "timestamp": timestamp,
                    "text": transcript,
                    "turn_order": turn_order,
                    "confidence": end_of_turn_confidence,
                    "type": "final"
                })
                print()

        elif msg_type == "Termination":
            audio_duration = data.get("audio_duration_seconds", 0)
            print(f"\n[SESSION] Terminated - Duration: {audio_duration}s")
            save_transcript()

        elif msg_type == "Error":
            error_msg = data.get("error", "Unknown error")
            print(f"\n[ERROR] {error_msg}")

    except json.JSONDecodeError as e:
        print(f"Error decoding message: {e}")
    except Exception as e:
        print(f"Error handling message: {e}")


def on_error(ws, error):
    """Called when a WebSocket error occurs."""
    print(f"\n[WEBSOCKET ERROR] {error}")
    stop_event.set()


def on_close(ws, close_status_code, close_msg):
    """Called when the WebSocket connection is closed."""
    print(f"\n[WEBSOCKET] Disconnected - Status: {close_status_code}, Message: {close_msg}")

    global stream, audio
    stop_event.set()

    # Clean up audio stream
    if stream:
        if stream.is_active():
            stream.stop_stream()
        stream.close()
        stream = None
    if audio:
        audio.terminate()
        audio = None
    if audio_thread and audio_thread.is_alive():
        audio_thread.join(timeout=1.0)


def save_transcript():
    """Save the transcript to a file"""
    if not transcript_buffer:
        print("No transcript to save.")
        return

    filename = f"meeting_transcript_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"

    with open(filename, "w", encoding="utf-8") as f:
        f.write("Meeting Transcript\n")
        f.write(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"Keyterms: {', '.join(KEYTERMS)}\n")
        f.write("=" * 80 + "\n\n")

        for entry in transcript_buffer:
            f.write(f"[{entry['timestamp']}] {entry['text']}\n")
            f.write(f"Confidence: {entry['confidence']:.2%}\n\n")

    print(f"Transcript saved to: {filename}")


def run():
    """Main function to run the streaming transcription"""
    global audio, stream, ws_app

    # Initialize PyAudio
    audio = pyaudio.PyAudio()

    # Open microphone stream
    try:
        stream = audio.open(
            input=True,
            frames_per_buffer=FRAMES_PER_BUFFER,
            channels=CHANNELS,
            format=FORMAT,
            rate=SAMPLE_RATE,
        )
        print("Microphone stream opened successfully.")
    except Exception as e:
        print(f"Error opening microphone stream: {e}")
        if audio:
            audio.terminate()
        return

    # Create WebSocketApp
    ws_app = websocket.WebSocketApp(
        API_ENDPOINT,
        header={"Authorization": YOUR_API_KEY},
        on_open=on_open,
        on_message=on_message,
        on_error=on_error,
        on_close=on_close,
    )

    # Run WebSocketApp in a separate thread
    ws_thread = threading.Thread(target=ws_app.run_forever)
    ws_thread.daemon = True
    ws_thread.start()

    try:
        # Keep main thread alive until interrupted
        while ws_thread.is_alive():
            time.sleep(0.1)
    except KeyboardInterrupt:
        print("\n\nCtrl+C received. Stopping transcription...")
        stop_event.set()

        # Send termination message to the server
        if ws_app and ws_app.sock and ws_app.sock.connected:
            try:
                terminate_message = {"type": "Terminate"}
                ws_app.send(json.dumps(terminate_message))
                time.sleep(1)
            except Exception as e:
                print(f"Error sending termination message: {e}")

        if ws_app:
            ws_app.close()

        ws_thread.join(timeout=2.0)

    finally:
        # Final cleanup
        if stream and stream.is_active():
            stream.stop_stream()
        if stream:
            stream.close()
        if audio:
            audio.terminate()
        print("Cleanup complete. Exiting.")


if __name__ == "__main__":
    run()

These settings wait longer before ending turns to accommodate natural conversation pauses and ensure readable formatted text for display. You can tweak these settings to get the best results for your notetaker.

How Do I Handle Multichannel Meeting Audio?

Many meeting platforms (Zoom, Teams, Google Meet) can record each participant on separate audio channels. This dramatically improves speaker identification accuracy.

For Pre-recorded Meetings

config = aai.TranscriptionConfig(
    multichannel=True,  # Enable when each speaker is on different channel
    speaker_labels=False,  # Disable - channels already separate speakers
    # Other settings...
)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(audio_file, config=config)

# Access per-channel transcripts
for channel, channel_transcript in enumerate(transcript.channels):
    print(f"\n=== Channel {channel} ===")
    print(channel_transcript.text)

When to use multichannel:

Zoom local recordings with “Record separate audio file for each participant” enabled
Professional podcast recordings with individual microphones
Conference systems with dedicated channels per participant
Phone calls with caller and callee on separate channels

Benefits:

Perfect speaker separation - No diarization errors
No speaker confusion or overlap issues
Faster processing time - Diarization not needed
Higher accuracy - Model processes clean single-speaker audio

How to enable in meeting platforms:

Zoom: Settings → Recording → Advanced → “Record a separate audio file for each participant”
Teams: Requires third-party recording solutions like Recall.ai
Google Meet: Requires third-party recording solutions like Recall.ai

For Streaming Meetings

For real-time multichannel audio, create separate streaming sessions per channel:

import asyncio
import websockets

class ChannelTranscriber:
    def __init__(self, channel_id: int, speaker_name: str):
        self.channel_id = channel_id
        self.speaker_name = speaker_name
        self.connection_params = {
            "sample_rate": 16000,
            "speech_model": "universal-3-5-pro",
            "format_turns": True,
        }

    async def transcribe_channel(self, audio_stream):
        """Transcribe a single audio channel"""
        url = f"wss://streaming.assemblyai.com/v3/ws?{urlencode(self.connection_params)}"

        # If you're using `websockets` version 13.0 or later, use `additional_headers` parameter. For older versions (< 13.0), use `extra_headers` instead.
        async with websockets.connect(url, additional_headers={"Authorization": API_KEY}) as ws:
            # Send audio from this channel only
            async for audio_chunk in audio_stream:
                await ws.send(audio_chunk)

            # Receive transcripts
            async for message in ws:
                data = json.loads(message)
                if data.get("type") == "Turn" and data.get("end_of_turn"):
                    print(f"{self.speaker_name}: {data['transcript']}")

# Create transcriber for each channel
async def transcribe_multichannel_meeting(channel_audio_streams):
    transcribers = [
        ChannelTranscriber(0, "Alice"),
        ChannelTranscriber(1, "Bob"),
    ]

    # Run all channels concurrently
    await asyncio.gather(*[
        t.transcribe_channel(stream)
        for t, stream in zip(transcribers, channel_audio_streams)
    ])

See our multichannel streaming guide for complete implementation details.

How Should I Handle Pre-recorded Transcription in Production?

Choose the right approach based on your application’s needs:

Option 1: Simple Blocking Call

# Simple blocking call
transcript = await asyncio.to_thread(transcriber.transcribe, audio_url, config=config)

Pros:

Simple, straightforward code
Good for low volume applications
Easy to understand and debug

Cons:

Ties up resources while waiting
Not suitable for high volume
Cannot process multiple files simultaneously

Best for: Personal projects, prototypes, low-traffic applications

Option 2: Webhook Callbacks (Production Recommended)

config = aai.TranscriptionConfig(
    webhook_url="https://your-app.com/webhooks/assemblyai",
    webhook_auth_header_name="X-Webhook-Secret",
    webhook_auth_header_value="your_secret_here",
    speaker_labels=True,
    summarization=True,
    # ... other config
)

# Submit job and return immediately (non-blocking)
transcript = transcriber.submit(audio_url, config=config)
print(f"Job submitted: {transcript.id}")
# Your app can continue processing other requests

# Your webhook receives results when ready (typically 15-30% of audio duration)

Webhook handler example:

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route("/webhooks/assemblyai", methods=["POST"])
def assemblyai_webhook():
    # Verify webhook authenticity
    if request.headers.get("X-Webhook-Secret") != "your_secret_here":
        return jsonify({"error": "Unauthorized"}), 401

    import requests as http_requests

    data = request.json
    transcript_id = data["transcript_id"]
    status = data["status"]

    if status == "completed":
        # Fetch the full transcript (webhook only sends transcript_id and status)
        transcript = http_requests.get(
            f"https://api.assemblyai.com/v2/transcript/{transcript_id}",
            headers={"authorization": "your_api_key"}
        ).json()
        process_completed_meeting(transcript)
    elif status == "error":
        log_transcription_error(transcript_id)

    return jsonify({"received": True}), 200

def process_completed_meeting(transcript):
    """Process completed meeting transcript"""
    utterances = transcript["utterances"]
    summary = transcript["summary"]

    # Store in database
    save_to_database(transcript)

    # Notify user
    send_notification(transcript["id"])

Pros:

Non-blocking - submit and forget
Scales to high volume
Process multiple files in parallel
Automatic retry on failures
Get notified when complete

Best for: Production apps, user-uploaded recordings, batch processing, SaaS products

Option 3: Polling (Custom Workflows)

# Submit job
transcript = transcriber.submit(audio_url, config=config)
print(f"Submitted: {transcript.id}")

# Poll for completion with progress tracking
while transcript.status not in [aai.TranscriptStatus.completed, aai.TranscriptStatus.error]:
    await asyncio.sleep(5)
    transcript = transcriber.get_transcript(transcript.id)

    # Optional: Show progress
    print(f"Status: {transcript.status}...")

if transcript.status == aai.TranscriptStatus.completed:
    process_transcript(transcript)
else:
    print(f"Error: {transcript.error}")

Pros:

Full control over retry logic
Can show progress to users
Good for background jobs
Works without webhook infrastructure

Cons:

Must implement your own polling loop
Ties up resources while polling
More complex than webhooks

Best for: Background job processors, CLIs with progress bars, custom retry logic

Comparison Table

Method	Blocking	Scalability	Complexity	Best For
Blocking	Yes	Low	Low	Prototypes, low volume
Webhooks	No	High	Medium	Production, high volume
Polling	Partial	Medium	Medium	Background jobs, progress UI

Scaling Considerations

HTTP rate limit: 20,000 requests per 5-minute window, counted across submissions (POST) and polling (GET) combined
Exceeding the limit: returns a 403 response
Parallel transcriptions (rate limit): 200+ for paid accounts (queued beyond that)
Ramp up gradually: start at 10-50 parallel requests, double incrementally
Avoid the rate limit: use webhooks or jittered, widened polling — see Polling without exceeding the rate limit
Contact Sales before large-scale rollouts

How Do I Identify Speakers in My Recording?

Speaker diarization tells you when speakers change (“Speaker A”, “Speaker B”), but Speaker Identification tells you who they are by name or role.

Why Use Speaker Identification?

Instead of:

Speaker A: Let's review the Q3 numbers.
Speaker B: Revenue was up 15% this quarter.
Speaker A: Excellent work on that launch.

You get:

Sarah Chen: Let's review the Q3 numbers.
Michael Rodriguez: Revenue was up 15% this quarter.
Sarah Chen: Excellent work on that launch.

How It Works

Speaker Identification uses AssemblyAI’s Speech Understanding API to map generic speaker labels to actual names or roles that you provide:

import assemblyai as aai

aai.settings.api_key = "your_api_key"

# Step 1: Transcribe with speaker diarization
config = aai.TranscriptionConfig(
    speaker_labels=True,  # Must enable speaker diarization first
    speech_understanding={
        "request": {
            "speaker_identification": {
                "speaker_type": "name",  # or "role"
                "known_values": ["Sarah Chen", "Michael Rodriguez", "Alex Kim"]
            }
        }
    }
)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("meeting_recording.mp3", config=config)

# Access results with identified speakers
for utterance in transcript.utterances:
    print(f"{utterance.speaker}: {utterance.text}")

Identifying by Role Instead of Name

For customer service, sales calls, or scenarios where you don’t know names:

config = aai.TranscriptionConfig(
    speaker_labels=True,
    speech_understanding={
        "request": {
            "speaker_identification": {
                "speaker_type": "role",
                "known_values": ["Agent", "Customer"]  # or ["Interviewer", "Interviewee"]
            }
        }
    }
)

Common role combinations:

["Agent", "Customer"] - Customer service calls
["Support", "Customer"] - Technical support
["Interviewer", "Interviewee"] - Interviews
["Host", "Guest"] - Podcasts
["Doctor", "Patient"] - Medical consultations (with HIPAA compliance)

How to Get Speaker Names

For platform recordings:

Zoom: Extract participant names from Zoom API or meeting JSON
Teams: Get attendees from Microsoft Graph API
Google Meet: Use Google Calendar API to get participants

Example with Zoom:

# Get participant names from Zoom meeting
zoom_participants = get_zoom_meeting_participants(meeting_id)
speaker_names = [p["name"] for p in zoom_participants]

# Use in speaker identification
config = aai.TranscriptionConfig(
    speaker_labels=True,
    speakers_expected=len(speaker_names),  # Exact number of speakers to detect
    speech_understanding={
        "request": {
            "speaker_identification": {
                "speaker_type": "name",
                "known_values": speaker_names
            }
        }
    }
)

How Speaker Identification Works

Speaker Identification Requirements:

Speaker diarization must be enabled - Cannot identify speakers without diarization first
Requires sufficient audio per speaker - Each speaker needs enough speech for accurate matching
Works best with distinct voices - Similar voices may be confused
Post-processing step - Adds additional processing time after transcription

Accuracy depends on:

Audio quality (clear, minimal background noise)
Voice distinctiveness (different genders, accents, tones)
Amount of speech per speaker (more = better)
Number of speakers (fewer = more accurate)

Alternative: Add Identification Later

You can add speaker identification to an existing transcript by posting to the Speech Understanding API with the transcript_id. This is useful when you get speaker names after the transcription completes, or when building iterative workflows where users confirm speaker identities.

import requests

# First, transcribe with speaker diarization
transcript = transcriber.transcribe(audio_url, config=aai.TranscriptionConfig(speaker_labels=True))

# Later, add speaker identification using the transcript ID
understanding_body = {
    "transcript_id": transcript.id,
    "speech_understanding": {
        "request": {
            "speaker_identification": {
                "speaker_type": "name",
                "known_values": ["Sarah Chen", "Michael Rodriguez"]
            }
        }
    }
}

result = requests.post(
    "https://llm-gateway.assemblyai.com/v1/understanding",
    headers={"Authorization": aai.settings.api_key},
    json=understanding_body
).json()

# Access identified speakers from the response
for utterance in result["utterances"]:
    print(f"{utterance['speaker']}: {utterance['text']}")

This approach is useful when:

You get speaker names after the transcription completes
You want to try different name mappings
Building iterative workflows where users confirm speaker identities

For complete API details, see our Speaker Identification documentation.

How Do I Translate Between Languages in Meetings?

AssemblyAI supports translation between 99+ languages, enabling you to transcribe meetings in one language and translate to another.

When to Use Translation

Common use cases:

Transcribe Spanish meeting → Translate to English for documentation
Transcribe multilingual meeting → Translate all to common language
Create translated meeting notes for international teams
Provide translated summaries for stakeholders

Basic Translation

Translation is a Speech Understanding feature. You enable it via the speech_understanding parameter with target_languages:

import requests
import time

base_url = "https://api.assemblyai.com"
headers = {"authorization": "YOUR_API_KEY"}

# Configure transcription with translation
data = {
    "audio_url": "https://assembly.ai/wildfires.mp3",
    "speech_models": ["universal-3-5-pro", "universal-2"],
    "language_detection": True,
    "speaker_labels": True,
    "speech_understanding": {
        "request": {
            "translation": {
                "target_languages": ["es", "de"],
                "formal": True
            }
        }
    }
}

response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
transcript_id = response.json()["id"]
polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"

while True:
    transcript = requests.get(polling_endpoint, headers=headers).json()
    if transcript["status"] == "completed":
        break
    elif transcript["status"] == "error":
        raise RuntimeError(f"Transcription failed: {transcript['error']}")
    else:
        time.sleep(3)

print("--- Original Transcript ---")
print(transcript["text"][:200] + "...")

print("\n--- Translations ---")
for language_code, translated_text in transcript["translated_texts"].items():
    print(f"{language_code.upper()}:")
    print(translated_text[:200] + "...")

Translation with Speaker Labels

For meetings where you need per-utterance translations with speaker attribution:

data = {
    "audio_url": audio_url,
    "speech_models": ["universal-3-5-pro", "universal-2"],
    "speaker_labels": True,
    "speech_understanding": {
        "request": {
            "translation": {
                "target_languages": ["es"],
                "match_original_utterance": True,
                "formal": True
            }
        }
    }
}

for utterance in transcript["utterances"]:
    print(f"Speaker {utterance['speaker']}:")
    print(f"  Original: {utterance['text'][:100]}...")
    print(f"  Spanish: {utterance['translated_texts']['es'][:100]}...")

Supported Language Pairs

AssemblyAI supports translation between 99+ languages, including: Popular combinations:

Spanish ↔ English
French ↔ English
German ↔ English
Mandarin ↔ English
Japanese ↔ English
Portuguese ↔ English
And all combinations between supported languages

Translation Response Format

The response includes translated_texts as a dictionary keyed by language code:

{
    "text": "Original transcript in source language",
    "translated_texts": {
        "es": "Translated transcript in Spanish",
        "de": "Translated transcript in German"
    },
    "utterances": [
        {
            "speaker": "A",
            "text": "Hello, how are you?",
            "translated_texts": {
                "es": "Hola, ¿cómo estás?"
            },
            "start": 0,
            "end": 1500
        }
    ]
}

For complete language support and translation details, see our Translation documentation.

What Workflows Can I Build for My AI Meeting Notetaker?

Use these Speech Understanding and Guardrails features to transform raw transcripts into actionable insights.

Summarization

summarization: true What it does: Generates an abstractive recap of the conversation (not verbatim).
Output: summary string (bullets/paragraph format).
Great for: Meeting notes, call recaps, executive summaries.
Notes: Condenses and rephrases; minor details may be omitted by design. Example:

config = aai.TranscriptionConfig(
    summarization=True,
    summary_type="bullets",  # or "bullets_verbose", "gist", "headline", "paragraph"
    summary_model="informative",  # or "conversational"
)

Sentiment Analysis

sentiment_analysis: true What it does: Scores per-utterance sentiment (positive / neutral / negative).
Output: Array of { text, sentiment, confidence, start, end }.
Great for: Customer satisfaction tracking, coaching, churn prediction.
Notes: Segment-level (not global mood); sarcasm and very short utterances are harder to classify. Example:

for utterance in transcript.sentiment_analysis_results:
    if utterance.sentiment == "NEGATIVE":
        print(f"Negative sentiment detected: {utterance.text}")

Entity Detection

entity_detection: true What it does: Extracts named entities (people, organizations, locations, products, etc.).
Output: Array of { entity_type, text, start, end }.
Great for: Auto-tagging topics, tracking competitors mentioned, CRM enrichment.
Notes: Operates on post-redaction text if PII redaction is enabled. Example:

# Extract all organizations mentioned
organizations = [
    entity.text for entity in transcript.entities
    if entity.entity_type == "organization"
]
print(f"Companies mentioned: {', '.join(organizations)}")

Redact PII Text

redact_pii: true What it does: Scans transcript for personally identifiable information and replaces matches per policy.
Output: text with replacements; original words timing preserved.
Great for: GDPR/CCPA compliance, safe sharing, SOC2 requirements.
Notes: Runs before downstream features; they see the redacted text. Recommended policies for meetings:

config = aai.TranscriptionConfig(
    redact_pii=True,
    redact_pii_policies=[
        PIIRedactionPolicy.person_name,      # Remove names
        PIIRedactionPolicy.email_address,    # Remove emails
        PIIRedactionPolicy.phone_number,     # Remove phone numbers
        PIIRedactionPolicy.organization,     # Remove company names
    ],
    redact_pii_sub=PIISubstitutionPolicy.hash,  # Stable hash tokens
)

Why hash substitution?

Stable across the file (same value → same token)
Maintains sentence structure for LLM processing
Prevents reconstruction of original data

Redact PII Audio

redact_pii_audio: true What it does: Produces a second audio file where redacted portions are bleeped/silenced.
Output: redacted_audio_url in the transcript response.
Great for: External sharing, training materials, demos.
Notes: Original audio is untouched; bleeped sections may sound choppy.

Complete Example

config = aai.TranscriptionConfig(
    # Core transcription
    speaker_labels=True,

    # Speech Understanding
    summarization=True,
    sentiment_analysis=True,
    entity_detection=True,

    # PII protection
    redact_pii=True,
    redact_pii_policies=[
        PIIRedactionPolicy.person_name,
        PIIRedactionPolicy.email_address,
        PIIRedactionPolicy.phone_number,
    ],
    redact_pii_sub=PIISubstitutionPolicy.hash,
    redact_pii_audio=True,
)

transcript = transcriber.transcribe(audio_url, config=config)

# Access all features
meeting_insights = {
    "summary": transcript.summary,
    "sentiment_trend": analyze_sentiment_trend(transcript.sentiment_analysis_results),
    "entities": extract_entities(transcript.entities),
    "safe_transcript": transcript.text,  # PII redacted
    "safe_audio": transcript.redacted_audio_url,  # PII bleeped
}

How Do I Improve the Accuracy of My Notetaker?

Best practices:

Include participant names for better speaker recognition
Add company-specific jargon and acronyms
Include product names and technical terms
Keep individual terms under 50 characters
Up to 200 terms per request (Universal-2) or 1000 terms (Universal-3.5 Pro)

Using Keyterms Prompt for Pre-recorded Transcription

Keyterms prompting improves recognition accuracy for domain-specific vocabulary by up to 21%:

# Define domain-specific vocabulary
company_terms = [
    "AssemblyAI",
    "Universal-3.5 Pro",
    "Speech Understanding",
    "diarization"
]

participant_names = [
    "Dylan Fox",
    "Sarah Chen",
    "Michael Rodriguez"
]

technical_terms = [
    "API endpoint",
    "WebSocket",
    "latency metrics",
    "TTFT"
]

# Configure with keyterms prompt
config = aai.TranscriptionConfig(
    keyterms_prompt=company_terms + participant_names + technical_terms,
    speaker_labels=True,
    # ... other settings
)

Using Keyterms Prompt for Streaming

# Streaming with contextual keyterms
keyterms = [
    # Participant names
    "Alice Johnson",
    "Bob Smith",

    # Meeting-specific vocabulary
    "Q4 objectives",
    "revenue targets",
    "customer acquisition",

    # Technical terms
    "API integration",
    "cloud migration"
]

CONNECTION_PARAMS = {
    "sample_rate": 16000,
    "speech_model": "universal-3-5-pro",
    "format_turns": True,
    "keyterms_prompt": keyterms,
}

How Do I Process the Response from the API?

Processing Pre-recorded Responses

def process_transcript(transcript):
    """
    Extract and process all relevant data from pre-recorded transcript
    """
    # Basic transcript data
    meeting_data = {
        "id": transcript.id,
        "duration": transcript.audio_duration,
        "confidence": transcript.confidence,
        "full_text": transcript.text
    }

    # Process speaker utterances
    speakers = {}
    for utterance in transcript.utterances:
        speaker = utterance.speaker

        if speaker not in speakers:
            speakers[speaker] = {
                "utterances": [],
                "total_speaking_time": 0,
                "word_count": 0
            }

        speakers[speaker]["utterances"].append({
            "text": utterance.text,
            "start": utterance.start,
            "end": utterance.end,
            "confidence": utterance.confidence
        })

        # Calculate speaking time
        speakers[speaker]["total_speaking_time"] += (utterance.end - utterance.start) / 1000
        speakers[speaker]["word_count"] += len(utterance.text.split())

    meeting_data["speakers"] = speakers

    # Extract summary
    if transcript.summary:
        meeting_data["summary"] = transcript.summary

    # Calculate meeting statistics
    total_duration = transcript.audio_duration  # Already in seconds
    meeting_data["statistics"] = {
        "total_speakers": len(speakers),
        "total_words": sum(s["word_count"] for s in speakers.values()),
        "average_confidence": transcript.confidence,
        "speaking_distribution": {
            speaker: {
                "percentage": (data["total_speaking_time"] / total_duration) * 100,
                "minutes": data["total_speaking_time"] / 60
            }
            for speaker, data in speakers.items()
        }
    }

    return meeting_data

# Example usage
result = process_transcript(transcript)
print(f"Meeting had {result['statistics']['total_speakers']} speakers")
print(f"Speaker A spoke for {result['statistics']['speaking_distribution']['A']['minutes']:.1f} minutes")

Processing Streaming Responses

class StreamingResponseProcessor:
    def __init__(self):
        self.partial_buffer = ""
        self.final_transcripts = []
        self.turn_metadata = []

    def process_message(self, message: dict):
        """
        Process real-time streaming messages
        """
        msg_type = message.get("type")

        if msg_type == "Begin":
            return {
                "event": "session_started",
                "session_id": message.get("id"),
                "expires_at": message.get("expires_at")
            }

        elif msg_type == "Turn":
            return self.process_turn(message)

        elif msg_type == "Termination":
            return {
                "event": "session_ended",
                "audio_duration": message.get("audio_duration_seconds"),
                "session_duration": message.get("session_duration_seconds")
            }

    def process_turn(self, data: dict):
        """Process turn messages"""
        is_final = data.get("end_of_turn")
        transcript = data.get("transcript", "")
        turn_order = data.get("turn_order")

        response = {
            "turn_order": turn_order,
            "is_final": is_final,
            "confidence": data.get("end_of_turn_confidence", 0)
        }

        # Handle partials (for live display)
        if not is_final and transcript:
            self.partial_buffer = transcript
            response["event"] = "partial"
            response["text"] = transcript

        # Handle finals (for storage)
        elif is_final:
            final_transcript = {
                "turn_order": turn_order,
                "text": transcript,
                "confidence": data.get("end_of_turn_confidence"),
                "timestamp": datetime.now().isoformat()
            }
            self.final_transcripts.append(final_transcript)
            response["event"] = "final"
            response["text"] = transcript

            # Clear partial buffer
            self.partial_buffer = ""

        return response

    def get_full_transcript(self):
        """
        Combine all final transcripts into complete meeting transcript
        """
        return {
            "full_text": " ".join(t["text"] for t in self.final_transcripts),
            "transcripts": self.final_transcripts,
            "total_turns": len(self.final_transcripts)
        }

# Example usage
processor = StreamingResponseProcessor()

# If you're using `websockets` version 13.0 or later, use `additional_headers` parameter. For older versions (< 13.0), use `extra_headers` instead.
async with websockets.connect(API_ENDPOINT, additional_headers=headers) as ws:
    async for message in ws:
        data = json.loads(message)
        result = processor.process_message(data)

        if result["event"] == "partial":
            # Update UI with live transcript
            update_live_caption(result["text"])

        elif result["event"] == "final":
            # Save final transcript
            save_transcript_segment(result)

# Get complete transcript when done
full_transcript = processor.get_full_transcript()

​Introduction

​Why AssemblyAI for Meeting Notetakers?

​Industry-Leading Accuracy with Pre-recorded Audio

​Streaming with Universal-3.5 Pro

​End-to-End Voice AI Platform

​When Should I Use Pre-recorded vs Streaming for Meeting Notetakers?

​Pre-recorded Speech-to-text

​Streaming Speech-to-text

​Hybrid Approach (Recommended)

​What Languages and Features for a Meeting Notetaker?

​Pre-Recorded Meetings

​Real-Time Streaming

​Streaming (Universal-3.5 Pro Streaming)

​How Can I Get Started Building a Post-Call Meeting Notetaker?

​How Can I Get Started Building a During-Call Live Meeting Notetaker?

​How Do I Handle Multichannel Meeting Audio?

​For Pre-recorded Meetings

​For Streaming Meetings

​How Should I Handle Pre-recorded Transcription in Production?

​Option 1: Simple Blocking Call

​Option 2: Webhook Callbacks (Production Recommended)

​Option 3: Polling (Custom Workflows)

​Comparison Table

​Scaling Considerations

​How Do I Identify Speakers in My Recording?

​Why Use Speaker Identification?

​How It Works

​Identifying by Role Instead of Name

​How to Get Speaker Names

​How Speaker Identification Works

​Alternative: Add Identification Later

​How Do I Translate Between Languages in Meetings?

​When to Use Translation

​Basic Translation

​Translation with Speaker Labels

​Supported Language Pairs

​Translation Response Format

​What Workflows Can I Build for My AI Meeting Notetaker?

​Summarization

​Sentiment Analysis

​Entity Detection

​Redact PII Text

​Redact PII Audio

​Complete Example

​How Do I Improve the Accuracy of My Notetaker?

​Using Keyterms Prompt for Pre-recorded Transcription

​Using Keyterms Prompt for Streaming

​How Do I Process the Response from the API?

​Processing Pre-recorded Responses

​Processing Streaming Responses

​Additional Resources

Introduction

Why AssemblyAI for Meeting Notetakers?

Industry-Leading Accuracy with Pre-recorded Audio

Streaming with Universal-3.5 Pro

End-to-End Voice AI Platform

When Should I Use Pre-recorded vs Streaming for Meeting Notetakers?

Pre-recorded Speech-to-text

Streaming Speech-to-text

Hybrid Approach (Recommended)

What Languages and Features for a Meeting Notetaker?

Pre-Recorded Meetings

Real-Time Streaming

Streaming (Universal-3.5 Pro Streaming)

How Can I Get Started Building a Post-Call Meeting Notetaker?

How Can I Get Started Building a During-Call Live Meeting Notetaker?

How Do I Handle Multichannel Meeting Audio?

For Pre-recorded Meetings

For Streaming Meetings

How Should I Handle Pre-recorded Transcription in Production?

Option 1: Simple Blocking Call

Option 2: Webhook Callbacks (Production Recommended)

Option 3: Polling (Custom Workflows)

Comparison Table

Scaling Considerations

How Do I Identify Speakers in My Recording?

Why Use Speaker Identification?

How It Works

Identifying by Role Instead of Name

How to Get Speaker Names

How Speaker Identification Works

Alternative: Add Identification Later

How Do I Translate Between Languages in Meetings?

When to Use Translation

Basic Translation

Translation with Speaker Labels

Supported Language Pairs

Translation Response Format

What Workflows Can I Build for My AI Meeting Notetaker?

Summarization

Sentiment Analysis

Entity Detection

Redact PII Text

Redact PII Audio

Complete Example

How Do I Improve the Accuracy of My Notetaker?

Using Keyterms Prompt for Pre-recorded Transcription

Using Keyterms Prompt for Streaming

How Do I Process the Response from the API?

Processing Pre-recorded Responses

Processing Streaming Responses

Additional Resources