Use cases & integrationsUse case guides

Best Practices for building Meeting Notetakers

Introduction

Building a robust meeting notetaker requires careful consideration of accuracy, latency, speaker identification, and real-time capabilities. This guide addresses common questions and provides practical solutions for both post-call and live meeting transcription scenarios.

Why AssemblyAI for Meeting Notetakers?

AssemblyAI stands out as the premier choice for meeting notetakers with several key advantages:

Industry-Leading Accuracy with Pre-recorded Audio

  • 93.3%+ transcription accuracy ensures reliable meeting documentation
  • 2.9% speaker diarization error rate for precise “who said what” attribution
  • Speech Understanding integration for intelligent post-processing and insights
  • Keyterms prompt allows providing meeting context to improve accuracy of transcription

Streaming with Universal-3 Pro

As meeting notetakers evolve toward real-time capabilities, AssemblyAI’s Universal-3 Pro Streaming model (u3-rt-pro) offers significant benefits:

  • Speaker diarization available for both pre-recorded and streaming transcription
  • Ultra-low latency (~300ms) enables live transcription without delays
  • Format turns feature provides structured, readable output in real-time
  • Keyterms prompt allows providing meeting context to improve accuracy of transcription

End-to-End Voice AI Platform

Unlike fragmented solutions, AssemblyAI provides a unified API for:

  • Transcription with speaker diarization
  • Automatic language detection and code switching
  • Boosting accuracy via meeting context with keyterms prompt
  • Speech Understanding tasks like speaker identification, translation, and transcript styling
  • Post-processing workflows with custom prompting - from summarization to completely custom workflows
  • Real-time and batch processing of pre-recorded audio in a single platform

When Should I Use Pre-recorded vs Streaming for Meeting Notetakers?

Understanding when to use pre-recorded versus streaming speech-to-text is critical for building the right meeting notetaker.

Pre-recorded Speech-to-text

Post-call analysis - Meeting already happened, you have the full recording

  • Highest accuracy needed - Pre-recorded models have higher accuracy (93.3%+)
  • Speaker diarization is critical - Pre-recorded has 2.9% speaker error rate
  • Broad language support - Need any of 99+ languages
  • Advanced features required - Summarization, sentiment analysis, entity detection, PII redaction, speaker identification
  • Batch processing - Processing multiple recordings at once
  • Quality over speed - Can wait seconds/minutes for perfect results

Best for: Zoom/Teams/Meet recording uploads, compliance, documentation, post-call summaries, searchable archives

Streaming Speech-to-text

Live meetings - Transcribing as the meeting happens

You should use streaming when you need to display a live transcript of text to users as they are speaking. With Universal-3 Pro Streaming, accuracy is closer to pre-recorded, but pre-recorded will always be the most accurate option.

  • Real-time captions - Displaying subtitles/captions to participants during calls
  • Immediate feedback - Need transcription within ~300ms
  • Interactive features - Live note-taking, real-time keyword detection, action item alerts
  • No recording available - Processing live audio only

Best for: Live captions, real-time note-taking apps, accessibility features, live keyword alerts

Many successful meeting notetakers use both pre-recorded and streaming speech-to-text:

  1. Streaming during the call - Provide live captions and real-time notes to participants
  2. Pre-recorded after the call - Generate high-quality transcript with speaker labels, summary, and insights

This gives users immediate value during meetings while providing comprehensive documentation afterward.

Example workflow:

  • User joins meeting → Start streaming for live captions
  • Meeting ends → Upload recording to pre-recorded API for final transcript with speaker names
  • Generate meeting summary, action items, and searchable archive from pre-recorded transcript

What Languages and Features for a Meeting Notetaker?

Pre-Recorded Meetings

For post-call analysis, AssemblyAI supports:

Languages:

  • 99 languages supported
  • Automatic Language Detection to route to the most spoken language
  • Code Switching to preserve changes in speech between languages

Core Features:

  • Speaker diarization (1-10 speakers by default, expandable to any min/max)
  • Multichannel audio support (each channel = one speaker)
  • Automatic formatting, punctuation, and capitalization
  • Keyterms prompting for boosting domain-specific terms

Speech Understanding Models:

  • Summarization for meeting recaps
  • Sentiment analysis for meeting tone assessment
  • Entity detection for extracting key information
  • Speaker identification to map generic labels to actual names/roles
  • Translation between 99+ languages

Real-Time Streaming

For live meeting transcription:

Languages:

  • English-only model (default)
  • Multilingual model supporting English, Spanish, French, German, Portuguese, and Italian

Streaming (Universal-3 Pro Streaming)

  • Speaker diarization for identifying who is speaking
  • Partial and final transcripts for responsive UI
  • Format turns for structured, readable output
  • Keyterms prompt for contextual accuracy

See the Universal-3 Pro Streaming documentation for full details.

How Can I Get Started Building a Post-Call Meeting Notetaker?

Here’s a complete example implementing pre-recorded transcription with all essential features:

1import assemblyai as aai
2import asyncio
3from typing import Dict, List
4from assemblyai.types import (
5 SpeakerOptions,
6 LanguageDetectionOptions,
7 PIIRedactionPolicy,
8 PIISubstitutionPolicy,
9)
10
11# Configure API key
12aai.settings.api_key = "your_api_key_here"
13
14async def transcribe_meeting_async(audio_source: str) -> Dict:
15 """
16 Asynchronously transcribe a meeting recording with full features
17
18 Args:
19 audio_source: Either a local file path or publicly accessible URL
20 """
21 # Configure comprehensive meeting analysis
22 config = aai.TranscriptionConfig(
23 # Speaker diarization
24 speaker_labels=True,
25 speakers_expected=None, # Use if you know exact number from Zoom/Meet/Teams
26 speaker_options=SpeakerOptions(
27 min_speakers_expected=2,
28 max_speakers_expected=10 # Keeping max high is safe and won't hurt accuracy
29 ),
30 multichannel=False, # Set to True if audio has separate channel per speaker
31
32 # Language detection
33 language_detection=True, # Auto-detect the most used language
34 language_detection_options=LanguageDetectionOptions(
35 code_switching=True, # Preserve language switches
36 code_switching_confidence_threshold=0.5,
37 ),
38
39 # Punctuation and formatting
40 punctuate=True,
41 format_text=True,
42
43 # Boost accuracy of meeting-specific vocabulary
44 keyterms_prompt=["quarterly", "KPI", "roadmap", "deliverables"],
45
46 # Speech Understanding - commonly used models
47 summarization=True,
48 sentiment_analysis=True,
49 entity_detection=True,
50 redact_pii=True,
51 redact_pii_policies=[
52 PIIRedactionPolicy.person_name,
53 PIIRedactionPolicy.organization,
54 PIIRedactionPolicy.occupation,
55 ],
56 redact_pii_sub=PIISubstitutionPolicy.hash,
57 redact_pii_audio=True
58 )
59
60 # Create transcriber
61 transcriber = aai.Transcriber()
62
63 try:
64 # Submit transcription job
65 transcript = await asyncio.to_thread(
66 transcriber.transcribe,
67 audio_source,
68 config=config
69 )
70
71 # Check status
72 if transcript.status == aai.TranscriptStatus.error:
73 raise Exception(f"Transcription failed: {transcript.error}")
74
75 # Process speaker-labeled utterances
76 print("\n=== SPEAKER-LABELED TRANSCRIPT ===\n")
77
78 for utterance in transcript.utterances:
79 # Format timestamp
80 start_time = utterance.start / 1000 # Convert to seconds
81 end_time = utterance.end / 1000
82
83 # Print formatted utterance
84 print(f"[{start_time:.1f}s - {end_time:.1f}s] Speaker {utterance.speaker}:")
85 print(f" {utterance.text}")
86 print(f" Confidence: {utterance.confidence:.2%}\n")
87
88 # Print summary data
89 print("\n=== MEETING SUMMARY ===\n")
90 print({
91 "id": transcript.id,
92 "status": transcript.status,
93 "duration": transcript.audio_duration,
94 "speaker_count": len(set(u.speaker for u in transcript.utterances)),
95 "word_count": len(transcript.words) if transcript.words else 0,
96 "detected_language": transcript.language_code if hasattr(transcript, 'language_code') else None,
97 "summary": transcript.summary,
98 })
99
100 return {
101 "transcript": transcript,
102 "utterances": transcript.utterances,
103 "summary": transcript.summary,
104 }
105
106 except Exception as e:
107 print(f"Error during transcription: {e}")
108 raise
109
110async def main():
111 """
112 Example usage with error handling
113 """
114 # Use either local file OR URL (not both)
115 audio_source = "https://assembly.ai/wildfires.mp3" # Or "path/to/recording.mp3"
116
117 try:
118 result = await transcribe_meeting_async(audio_source)
119
120 # Additional processing
121 print(f"\nTotal speakers identified: {len(set(u.speaker for u in result['utterances']))}")
122 print(f"Meeting duration: {result['transcript'].audio_duration} seconds")
123
124 except Exception as e:
125 print(f"Failed to process meeting: {e}")
126
127if __name__ == "__main__":
128 asyncio.run(main())

How Can I Get Started Building a During-Call Live Meeting Notetaker?

Here’s a complete example for real-time streaming transcription with meeting-optimized settings:

1# pip install pyaudio websocket-client
2import pyaudio
3import websocket
4import json
5import threading
6import time
7from urllib.parse import urlencode
8from datetime import datetime
9
10# --- Configuration ---
11YOUR_API_KEY = "your_api_key"
12
13# Keyterms to improve recognition accuracy
14KEYTERMS = [
15 "Alice Johnson",
16 "Bob Smith",
17 "Carol Davis",
18 "quarterly review",
19 "action items",
20 "follow up",
21 "deadline",
22 "budget"
23]
24
25# MEETING NOTETAKER CONFIGURATION (different from voice agents!)
26CONNECTION_PARAMS = {
27 "sample_rate": 16000,
28 "speech_model": "u3-rt-pro",
29 "format_turns": True, # ALWAYS TRUE for meetings - users need readable text
30
31 # Meeting-optimized turn detection (wait longer than voice agents)
32 # u3-rt-pro defaults: min_turn_silence=100ms, max_turn_silence=1000ms
33 "min_turn_silence": 560, # Wait longer for natural pauses (voice agents use ~100ms)
34 "max_turn_silence": 2000, # Allow thinking pauses
35
36 # Keyterms for accuracy - pass each term as a separate query parameter
37 "keyterms_prompt": KEYTERMS,
38}
39
40API_ENDPOINT_BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
41API_ENDPOINT = f"{API_ENDPOINT_BASE_URL}?{urlencode(CONNECTION_PARAMS, doseq=True)}"
42
43# Audio Configuration
44FRAMES_PER_BUFFER = 800 # 50ms of audio
45SAMPLE_RATE = CONNECTION_PARAMS["sample_rate"]
46CHANNELS = 1
47FORMAT = pyaudio.paInt16
48
49# Global variables
50audio = None
51stream = None
52ws_app = None
53audio_thread = None
54stop_event = threading.Event()
55transcript_buffer = []
56
57
58def on_open(ws):
59 """Called when the WebSocket connection is established."""
60 print("=" * 80)
61 print(f"[{datetime.now().strftime('%H:%M:%S')}] Meeting transcription started")
62 print(f"Connected to: {API_ENDPOINT_BASE_URL}")
63 print(f"Keyterms configured: {', '.join(KEYTERMS)}")
64 print("=" * 80)
65 print("\nSpeak into your microphone. Press Ctrl+C to stop.\n")
66
67 def stream_audio():
68 """Stream audio from microphone to WebSocket"""
69 global stream
70 while not stop_event.is_set():
71 try:
72 audio_data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
73 ws.send(audio_data, websocket.ABNF.OPCODE_BINARY)
74 except Exception as e:
75 if not stop_event.is_set():
76 print(f"Error streaming audio: {e}")
77 break
78
79 global audio_thread
80 audio_thread = threading.Thread(target=stream_audio)
81 audio_thread.daemon = True
82 audio_thread.start()
83
84
85def on_message(ws, message):
86 """Handle incoming messages from AssemblyAI"""
87 try:
88 data = json.loads(message)
89 msg_type = data.get("type")
90
91 # Uncomment to see full JSON for debugging:
92 # print("=" * 80)
93 # print(json.dumps(data, indent=2, ensure_ascii=False))
94 # print("=" * 80)
95 # print()
96
97 if msg_type == "Begin":
98 session_id = data.get("id", "N/A")
99 print(f"[SESSION] Started - ID: {session_id}\n")
100
101 elif msg_type == "Turn":
102 end_of_turn = data.get("end_of_turn", False)
103 turn_is_formatted = data.get("turn_is_formatted", False)
104 transcript = data.get("transcript", "")
105 turn_order = data.get("turn_order", 0)
106 end_of_turn_confidence = data.get("end_of_turn_confidence", 0.0)
107
108 # FOR MEETING NOTETAKERS: Show partials for responsive UI
109 if not end_of_turn and transcript:
110 print(f"\r[LIVE] {transcript}", end="", flush=True)
111
112 # FOR MEETING NOTETAKERS: Use formatted finals for readable display
113 # (Unlike voice agents which should use utterance for speed)
114 if end_of_turn and turn_is_formatted and transcript:
115 timestamp = datetime.now().strftime('%H:%M:%S')
116 print(f"\n[{timestamp}] {transcript}")
117 print(f" Turn: {turn_order} | Confidence: {end_of_turn_confidence:.2%}")
118
119 # Detect action items
120 transcript_lower = transcript.lower()
121 if any(term in transcript_lower for term in ["action item", "follow up", "deadline", "assigned to", "todo"]):
122 print(" ⚠️ ACTION ITEM DETECTED!")
123
124 # Store final transcript
125 transcript_buffer.append({
126 "timestamp": timestamp,
127 "text": transcript,
128 "turn_order": turn_order,
129 "confidence": end_of_turn_confidence,
130 "type": "final"
131 })
132 print()
133
134 elif msg_type == "Termination":
135 audio_duration = data.get("audio_duration_seconds", 0)
136 print(f"\n[SESSION] Terminated - Duration: {audio_duration}s")
137 save_transcript()
138
139 elif msg_type == "Error":
140 error_msg = data.get("error", "Unknown error")
141 print(f"\n[ERROR] {error_msg}")
142
143 except json.JSONDecodeError as e:
144 print(f"Error decoding message: {e}")
145 except Exception as e:
146 print(f"Error handling message: {e}")
147
148
149def on_error(ws, error):
150 """Called when a WebSocket error occurs."""
151 print(f"\n[WEBSOCKET ERROR] {error}")
152 stop_event.set()
153
154
155def on_close(ws, close_status_code, close_msg):
156 """Called when the WebSocket connection is closed."""
157 print(f"\n[WEBSOCKET] Disconnected - Status: {close_status_code}, Message: {close_msg}")
158
159 global stream, audio
160 stop_event.set()
161
162 # Clean up audio stream
163 if stream:
164 if stream.is_active():
165 stream.stop_stream()
166 stream.close()
167 stream = None
168 if audio:
169 audio.terminate()
170 audio = None
171 if audio_thread and audio_thread.is_alive():
172 audio_thread.join(timeout=1.0)
173
174
175def save_transcript():
176 """Save the transcript to a file"""
177 if not transcript_buffer:
178 print("No transcript to save.")
179 return
180
181 filename = f"meeting_transcript_{datetime.now().strftime('%Y%m%d_%H%M%S')}.txt"
182
183 with open(filename, "w", encoding="utf-8") as f:
184 f.write("Meeting Transcript\n")
185 f.write(f"Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
186 f.write(f"Keyterms: {', '.join(KEYTERMS)}\n")
187 f.write("=" * 80 + "\n\n")
188
189 for entry in transcript_buffer:
190 f.write(f"[{entry['timestamp']}] {entry['text']}\n")
191 f.write(f"Confidence: {entry['confidence']:.2%}\n\n")
192
193 print(f"Transcript saved to: {filename}")
194
195
196def run():
197 """Main function to run the streaming transcription"""
198 global audio, stream, ws_app
199
200 # Initialize PyAudio
201 audio = pyaudio.PyAudio()
202
203 # Open microphone stream
204 try:
205 stream = audio.open(
206 input=True,
207 frames_per_buffer=FRAMES_PER_BUFFER,
208 channels=CHANNELS,
209 format=FORMAT,
210 rate=SAMPLE_RATE,
211 )
212 print("Microphone stream opened successfully.")
213 except Exception as e:
214 print(f"Error opening microphone stream: {e}")
215 if audio:
216 audio.terminate()
217 return
218
219 # Create WebSocketApp
220 ws_app = websocket.WebSocketApp(
221 API_ENDPOINT,
222 header={"Authorization": YOUR_API_KEY},
223 on_open=on_open,
224 on_message=on_message,
225 on_error=on_error,
226 on_close=on_close,
227 )
228
229 # Run WebSocketApp in a separate thread
230 ws_thread = threading.Thread(target=ws_app.run_forever)
231 ws_thread.daemon = True
232 ws_thread.start()
233
234 try:
235 # Keep main thread alive until interrupted
236 while ws_thread.is_alive():
237 time.sleep(0.1)
238 except KeyboardInterrupt:
239 print("\n\nCtrl+C received. Stopping transcription...")
240 stop_event.set()
241
242 # Send termination message to the server
243 if ws_app and ws_app.sock and ws_app.sock.connected:
244 try:
245 terminate_message = {"type": "Terminate"}
246 ws_app.send(json.dumps(terminate_message))
247 time.sleep(1)
248 except Exception as e:
249 print(f"Error sending termination message: {e}")
250
251 if ws_app:
252 ws_app.close()
253
254 ws_thread.join(timeout=2.0)
255
256 finally:
257 # Final cleanup
258 if stream and stream.is_active():
259 stream.stop_stream()
260 if stream:
261 stream.close()
262 if audio:
263 audio.terminate()
264 print("Cleanup complete. Exiting.")
265
266
267if __name__ == "__main__":
268 run()

These settings wait longer before ending turns to accommodate natural conversation pauses and ensure readable formatted text for display. You can tweak these settings to get the best results for your notetaker.

How Do I Handle Multichannel Meeting Audio?

Many meeting platforms (Zoom, Teams, Google Meet) can record each participant on separate audio channels. This dramatically improves speaker identification accuracy.

For Pre-recorded Meetings

1config = aai.TranscriptionConfig(
2 multichannel=True, # Enable when each speaker is on different channel
3 speaker_labels=False, # Disable - channels already separate speakers
4 # Other settings...
5)
6
7transcriber = aai.Transcriber()
8transcript = transcriber.transcribe(audio_file, config=config)
9
10# Access per-channel transcripts
11for channel, channel_transcript in enumerate(transcript.channels):
12 print(f"\n=== Channel {channel} ===")
13 print(channel_transcript.text)

When to use multichannel:

  • Zoom local recordings with “Record separate audio file for each participant” enabled
  • Professional podcast recordings with individual microphones
  • Conference systems with dedicated channels per participant
  • Phone calls with caller and callee on separate channels

Benefits:

  • Perfect speaker separation - No diarization errors
  • No speaker confusion or overlap issues
  • Faster processing time - Diarization not needed
  • Higher accuracy - Model processes clean single-speaker audio

How to enable in meeting platforms:

  • Zoom: Settings → Recording → Advanced → “Record a separate audio file for each participant”
  • Teams: Requires third-party recording solutions like Recall.ai
  • Google Meet: Requires third-party recording solutions like Recall.ai

For Streaming Meetings

For real-time multichannel audio, create separate streaming sessions per channel:

1import asyncio
2import websockets
3
4class ChannelTranscriber:
5 def __init__(self, channel_id: int, speaker_name: str):
6 self.channel_id = channel_id
7 self.speaker_name = speaker_name
8 self.connection_params = {
9 "sample_rate": 16000,
10 "speech_model": "u3-rt-pro",
11 "format_turns": True,
12 }
13
14 async def transcribe_channel(self, audio_stream):
15 """Transcribe a single audio channel"""
16 url = f"wss://streaming.assemblyai.com/v3/ws?{urlencode(self.connection_params)}"
17
18 # If you're using `websockets` version 13.0 or later, use `additional_headers` parameter. For older versions (< 13.0), use `extra_headers` instead.
19 async with websockets.connect(url, additional_headers={"Authorization": API_KEY}) as ws:
20 # Send audio from this channel only
21 async for audio_chunk in audio_stream:
22 await ws.send(audio_chunk)
23
24 # Receive transcripts
25 async for message in ws:
26 data = json.loads(message)
27 if data.get("type") == "Turn" and data.get("turn_is_formatted"):
28 print(f"{self.speaker_name}: {data['transcript']}")
29
30# Create transcriber for each channel
31async def transcribe_multichannel_meeting(channel_audio_streams):
32 transcribers = [
33 ChannelTranscriber(0, "Alice"),
34 ChannelTranscriber(1, "Bob"),
35 ]
36
37 # Run all channels concurrently
38 await asyncio.gather(*[
39 t.transcribe_channel(stream)
40 for t, stream in zip(transcribers, channel_audio_streams)
41 ])

See our multichannel streaming guide for complete implementation details.

How Should I Handle Pre-recorded Transcription in Production?

Choose the right approach based on your application’s needs:

Option 1: Simple Blocking Call

1# Simple blocking call
2transcript = await asyncio.to_thread(transcriber.transcribe, audio_url, config=config)

Pros:

  • Simple, straightforward code
  • Good for low volume applications
  • Easy to understand and debug

Cons:

  • Ties up resources while waiting
  • Not suitable for high volume
  • Cannot process multiple files simultaneously

Best for: Personal projects, prototypes, low-traffic applications

1config = aai.TranscriptionConfig(
2 webhook_url="https://your-app.com/webhooks/assemblyai",
3 webhook_auth_header_name="X-Webhook-Secret",
4 webhook_auth_header_value="your_secret_here",
5 speaker_labels=True,
6 summarization=True,
7 # ... other config
8)
9
10# Submit job and return immediately (non-blocking)
11transcript = transcriber.submit(audio_url, config=config)
12print(f"Job submitted: {transcript.id}")
13# Your app can continue processing other requests
14
15# Your webhook receives results when ready (typically 15-30% of audio duration)

Webhook handler example:

1from flask import Flask, request, jsonify
2
3app = Flask(__name__)
4
5@app.route("/webhooks/assemblyai", methods=["POST"])
6def assemblyai_webhook():
7 # Verify webhook authenticity
8 if request.headers.get("X-Webhook-Secret") != "your_secret_here":
9 return jsonify({"error": "Unauthorized"}), 401
10
11 import requests as http_requests
12
13 data = request.json
14 transcript_id = data["transcript_id"]
15 status = data["status"]
16
17 if status == "completed":
18 # Fetch the full transcript (webhook only sends transcript_id and status)
19 transcript = http_requests.get(
20 f"https://api.assemblyai.com/v2/transcript/{transcript_id}",
21 headers={"authorization": "your_api_key"}
22 ).json()
23 process_completed_meeting(transcript)
24 elif status == "error":
25 log_transcription_error(transcript_id)
26
27 return jsonify({"received": True}), 200
28
29def process_completed_meeting(transcript):
30 """Process completed meeting transcript"""
31 utterances = transcript["utterances"]
32 summary = transcript["summary"]
33
34 # Store in database
35 save_to_database(transcript)
36
37 # Notify user
38 send_notification(transcript["id"])

Pros:

  • Non-blocking - submit and forget
  • Scales to high volume
  • Process multiple files in parallel
  • Automatic retry on failures
  • Get notified when complete

Best for: Production apps, user-uploaded recordings, batch processing, SaaS products

Option 3: Polling (Custom Workflows)

1# Submit job
2transcript = transcriber.submit(audio_url, config=config)
3print(f"Submitted: {transcript.id}")
4
5# Poll for completion with progress tracking
6while transcript.status not in [aai.TranscriptStatus.completed, aai.TranscriptStatus.error]:
7 await asyncio.sleep(5)
8 transcript = transcriber.get_transcript(transcript.id)
9
10 # Optional: Show progress
11 print(f"Status: {transcript.status}...")
12
13if transcript.status == aai.TranscriptStatus.completed:
14 process_transcript(transcript)
15else:
16 print(f"Error: {transcript.error}")

Pros:

  • Full control over retry logic
  • Can show progress to users
  • Good for background jobs
  • Works without webhook infrastructure

Cons:

  • Must implement your own polling loop
  • Ties up resources while polling
  • More complex than webhooks

Best for: Background job processors, CLIs with progress bars, custom retry logic

Comparison Table

MethodBlockingScalabilityComplexityBest For
BlockingYesLowLowPrototypes, low volume
WebhooksNoHighMediumProduction, high volume
PollingPartialMediumMediumBackground jobs, progress UI

Scaling Considerations

  • Rate limits: 20,000 POST requests per 5-minute window
  • Concurrent transcriptions: 200+ for paid accounts (queued beyond that)
  • Ramp up gradually - Start at 10-50 concurrent, double incrementally
  • Use exponential backoff with jitter for 429 errors
  • Contact Sales before large-scale rollouts

How Do I Identify Speakers in My Recording?

Speaker diarization tells you when speakers change (“Speaker A”, “Speaker B”), but Speaker Identification tells you who they are by name or role.

Why Use Speaker Identification?

Instead of:

Speaker A: Let's review the Q3 numbers.
Speaker B: Revenue was up 15% this quarter.
Speaker A: Excellent work on that launch.

You get:

Sarah Chen: Let's review the Q3 numbers.
Michael Rodriguez: Revenue was up 15% this quarter.
Sarah Chen: Excellent work on that launch.

How It Works

Speaker Identification uses AssemblyAI’s Speech Understanding API to map generic speaker labels to actual names or roles that you provide:

1import assemblyai as aai
2
3aai.settings.api_key = "your_api_key"
4
5# Step 1: Transcribe with speaker diarization
6config = aai.TranscriptionConfig(
7 speaker_labels=True, # Must enable speaker diarization first
8 speech_understanding={
9 "request": {
10 "speaker_identification": {
11 "speaker_type": "name", # or "role"
12 "known_values": ["Sarah Chen", "Michael Rodriguez", "Alex Kim"]
13 }
14 }
15 }
16)
17
18transcriber = aai.Transcriber()
19transcript = transcriber.transcribe("meeting_recording.mp3", config=config)
20
21# Access results with identified speakers
22for utterance in transcript.utterances:
23 print(f"{utterance.speaker}: {utterance.text}")

Identifying by Role Instead of Name

For customer service, sales calls, or scenarios where you don’t know names:

1config = aai.TranscriptionConfig(
2 speaker_labels=True,
3 speech_understanding={
4 "request": {
5 "speaker_identification": {
6 "speaker_type": "role",
7 "known_values": ["Agent", "Customer"] # or ["Interviewer", "Interviewee"]
8 }
9 }
10 }
11)

Common role combinations:

  • ["Agent", "Customer"] - Customer service calls
  • ["Support", "Customer"] - Technical support
  • ["Interviewer", "Interviewee"] - Interviews
  • ["Host", "Guest"] - Podcasts
  • ["Doctor", "Patient"] - Medical consultations (with HIPAA compliance)

How to Get Speaker Names

For platform recordings:

  1. Zoom: Extract participant names from Zoom API or meeting JSON
  2. Teams: Get attendees from Microsoft Graph API
  3. Google Meet: Use Google Calendar API to get participants

Example with Zoom:

1# Get participant names from Zoom meeting
2zoom_participants = get_zoom_meeting_participants(meeting_id)
3speaker_names = [p["name"] for p in zoom_participants]
4
5# Use in speaker identification
6config = aai.TranscriptionConfig(
7 speaker_labels=True,
8 speakers_expected=len(speaker_names), # Hint: exact number of speakers
9 speech_understanding={
10 "request": {
11 "speaker_identification": {
12 "speaker_type": "name",
13 "known_values": speaker_names
14 }
15 }
16 }
17)

How Speaker Identification Works

Speaker Identification Requirements:

  1. Speaker diarization must be enabled - Cannot identify speakers without diarization first
  2. Requires sufficient audio per speaker - Each speaker needs enough speech for accurate matching
  3. Works best with distinct voices - Similar voices may be confused
  4. Post-processing step - Adds additional processing time after transcription

Accuracy depends on:

  • Audio quality (clear, minimal background noise)
  • Voice distinctiveness (different genders, accents, tones)
  • Amount of speech per speaker (more = better)
  • Number of speakers (fewer = more accurate)

Alternative: Add Identification Later

You can add speaker identification to an existing transcript by posting to the Speech Understanding API with the transcript_id. This is useful when you get speaker names after the transcription completes, or when building iterative workflows where users confirm speaker identities.

1import requests
2
3# First, transcribe with speaker diarization
4transcript = transcriber.transcribe(audio_url, config=aai.TranscriptionConfig(speaker_labels=True))
5
6# Later, add speaker identification using the transcript ID
7understanding_body = {
8 "transcript_id": transcript.id,
9 "speech_understanding": {
10 "request": {
11 "speaker_identification": {
12 "speaker_type": "name",
13 "known_values": ["Sarah Chen", "Michael Rodriguez"]
14 }
15 }
16 }
17}
18
19result = requests.post(
20 "https://llm-gateway.assemblyai.com/v1/understanding",
21 headers={"Authorization": aai.settings.api_key},
22 json=understanding_body
23).json()
24
25# Access identified speakers from the response
26for utterance in result["utterances"]:
27 print(f"{utterance['speaker']}: {utterance['text']}")

This approach is useful when:

  • You get speaker names after the transcription completes
  • You want to try different name mappings
  • Building iterative workflows where users confirm speaker identities

For complete API details, see our Speaker Identification documentation.

How Do I Translate Between Languages in Meetings?

AssemblyAI supports translation between 99+ languages, enabling you to transcribe meetings in one language and translate to another.

When to Use Translation

Common use cases:

  • Transcribe Spanish meeting → Translate to English for documentation
  • Transcribe multilingual meeting → Translate all to common language
  • Create translated meeting notes for international teams
  • Provide translated summaries for stakeholders

Basic Translation

Translation is a Speech Understanding feature. You enable it via the speech_understanding parameter with target_languages:

1import requests
2import time
3
4base_url = "https://api.assemblyai.com"
5headers = {"authorization": "YOUR_API_KEY"}
6
7# Configure transcription with translation
8data = {
9 "audio_url": "https://assembly.ai/wildfires.mp3",
10 "speech_models": ["universal-3-pro", "universal-2"],
11 "language_detection": True,
12 "speaker_labels": True,
13 "speech_understanding": {
14 "request": {
15 "translation": {
16 "target_languages": ["es", "de"],
17 "formal": True
18 }
19 }
20 }
21}
22
23response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
24transcript_id = response.json()["id"]
25polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"
26
27while True:
28 transcript = requests.get(polling_endpoint, headers=headers).json()
29 if transcript["status"] == "completed":
30 break
31 elif transcript["status"] == "error":
32 raise RuntimeError(f"Transcription failed: {transcript['error']}")
33 else:
34 time.sleep(3)
35
36print("--- Original Transcript ---")
37print(transcript["text"][:200] + "...")
38
39print("\n--- Translations ---")
40for language_code, translated_text in transcript["translated_texts"].items():
41 print(f"{language_code.upper()}:")
42 print(translated_text[:200] + "...")

Translation with Speaker Labels

For meetings where you need per-utterance translations with speaker attribution:

1data = {
2 "audio_url": audio_url,
3 "speech_models": ["universal-3-pro", "universal-2"],
4 "speaker_labels": True,
5 "speech_understanding": {
6 "request": {
7 "translation": {
8 "target_languages": ["es"],
9 "match_original_utterance": True,
10 "formal": True
11 }
12 }
13 }
14}
15
16for utterance in transcript["utterances"]:
17 print(f"Speaker {utterance['speaker']}:")
18 print(f" Original: {utterance['text'][:100]}...")
19 print(f" Spanish: {utterance['translated_texts']['es'][:100]}...")

Supported Language Pairs

AssemblyAI supports translation between 99+ languages, including:

Popular combinations:

  • Spanish ↔ English
  • French ↔ English
  • German ↔ English
  • Mandarin ↔ English
  • Japanese ↔ English
  • Portuguese ↔ English
  • And all combinations between supported languages

Translation Response Format

The response includes translated_texts as a dictionary keyed by language code:

1{
2 "text": "Original transcript in source language",
3 "translated_texts": {
4 "es": "Translated transcript in Spanish",
5 "de": "Translated transcript in German"
6 },
7 "utterances": [
8 {
9 "speaker": "A",
10 "text": "Hello, how are you?",
11 "translated_texts": {
12 "es": "Hola, ¿cómo estás?"
13 },
14 "start": 0,
15 "end": 1500
16 }
17 ]
18}

For complete language support and translation details, see our Translation documentation.

What Workflows Can I Build for My AI Meeting Notetaker?

Use these Speech Understanding and Guardrails features to transform raw transcripts into actionable insights.

Summarization

summarization: true

What it does: Generates an abstractive recap of the conversation (not verbatim).
Output: summary string (bullets/paragraph format).
Great for: Meeting notes, call recaps, executive summaries.
Notes: Condenses and rephrases; minor details may be omitted by design.

Example:

1config = aai.TranscriptionConfig(
2 summarization=True,
3 summary_type="bullets", # or "bullets_verbose", "gist", "headline", "paragraph"
4 summary_model="informative", # or "conversational"
5)

Sentiment Analysis

sentiment_analysis: true

What it does: Scores per-utterance sentiment (positive / neutral / negative).
Output: Array of { text, sentiment, confidence, start, end }.
Great for: Customer satisfaction tracking, coaching, churn prediction.
Notes: Segment-level (not global mood); sarcasm and very short utterances are harder to classify.

Example:

1for utterance in transcript.sentiment_analysis_results:
2 if utterance.sentiment == "NEGATIVE":
3 print(f"Negative sentiment detected: {utterance.text}")

Entity Detection

entity_detection: true

What it does: Extracts named entities (people, organizations, locations, products, etc.).
Output: Array of { entity_type, text, start, end }.
Great for: Auto-tagging topics, tracking competitors mentioned, CRM enrichment.
Notes: Operates on post-redaction text if PII redaction is enabled.

Example:

1# Extract all organizations mentioned
2organizations = [
3 entity.text for entity in transcript.entities
4 if entity.entity_type == "organization"
5]
6print(f"Companies mentioned: {', '.join(organizations)}")

Redact PII Text

redact_pii: true

What it does: Scans transcript for personally identifiable information and replaces matches per policy.
Output: text with replacements; original words timing preserved.
Great for: GDPR/CCPA compliance, safe sharing, SOC2 requirements.
Notes: Runs before downstream features; they see the redacted text.

Recommended policies for meetings:

1config = aai.TranscriptionConfig(
2 redact_pii=True,
3 redact_pii_policies=[
4 PIIRedactionPolicy.person_name, # Remove names
5 PIIRedactionPolicy.email_address, # Remove emails
6 PIIRedactionPolicy.phone_number, # Remove phone numbers
7 PIIRedactionPolicy.organization, # Remove company names
8 ],
9 redact_pii_sub=PIISubstitutionPolicy.hash, # Stable hash tokens
10)

Why hash substitution?

  • Stable across the file (same value → same token)
  • Maintains sentence structure for LLM processing
  • Prevents reconstruction of original data

Redact PII Audio

redact_pii_audio: true

What it does: Produces a second audio file where redacted portions are bleeped/silenced.
Output: redacted_audio_url in the transcript response.
Great for: External sharing, training materials, demos.
Notes: Original audio is untouched; bleeped sections may sound choppy.

Complete Example

1config = aai.TranscriptionConfig(
2 # Core transcription
3 speaker_labels=True,
4
5 # Speech Understanding
6 summarization=True,
7 sentiment_analysis=True,
8 entity_detection=True,
9
10 # PII protection
11 redact_pii=True,
12 redact_pii_policies=[
13 PIIRedactionPolicy.person_name,
14 PIIRedactionPolicy.email_address,
15 PIIRedactionPolicy.phone_number,
16 ],
17 redact_pii_sub=PIISubstitutionPolicy.hash,
18 redact_pii_audio=True,
19)
20
21transcript = transcriber.transcribe(audio_url, config=config)
22
23# Access all features
24meeting_insights = {
25 "summary": transcript.summary,
26 "sentiment_trend": analyze_sentiment_trend(transcript.sentiment_analysis_results),
27 "entities": extract_entities(transcript.entities),
28 "safe_transcript": transcript.text, # PII redacted
29 "safe_audio": transcript.redacted_audio_url, # PII bleeped
30}

How Do I Improve the Accuracy of My Notetaker?

Best practices:

  • Include participant names for better speaker recognition
  • Add company-specific jargon and acronyms
  • Include product names and technical terms
  • Keep individual terms under 50 characters
  • Up to 200 terms per request (Universal-2) or 1000 terms (Universal-3-Pro)

Using Keyterms Prompt for Pre-recorded Transcription

Keyterms prompting improves recognition accuracy for domain-specific vocabulary by up to 21%:

1# Define domain-specific vocabulary
2company_terms = [
3 "AssemblyAI",
4 "Universal-3 Pro",
5 "Speech Understanding",
6 "diarization"
7]
8
9participant_names = [
10 "Dylan Fox",
11 "Sarah Chen",
12 "Michael Rodriguez"
13]
14
15technical_terms = [
16 "API endpoint",
17 "WebSocket",
18 "latency metrics",
19 "TTFT"
20]
21
22# Configure with keyterms prompt
23config = aai.TranscriptionConfig(
24 keyterms_prompt=company_terms + participant_names + technical_terms,
25 speaker_labels=True,
26 # ... other settings
27)

Using Keyterms Prompt for Streaming

1# Streaming with contextual keyterms
2keyterms = [
3 # Participant names
4 "Alice Johnson",
5 "Bob Smith",
6
7 # Meeting-specific vocabulary
8 "Q4 objectives",
9 "revenue targets",
10 "customer acquisition",
11
12 # Technical terms
13 "API integration",
14 "cloud migration"
15]
16
17CONNECTION_PARAMS = {
18 "sample_rate": 16000,
19 "speech_model": "u3-rt-pro",
20 "format_turns": True,
21 "keyterms_prompt": keyterms,
22}

How Do I Process the Response from the API?

Processing Pre-recorded Responses

1def process_transcript(transcript):
2 """
3 Extract and process all relevant data from pre-recorded transcript
4 """
5 # Basic transcript data
6 meeting_data = {
7 "id": transcript.id,
8 "duration": transcript.audio_duration,
9 "confidence": transcript.confidence,
10 "full_text": transcript.text
11 }
12
13 # Process speaker utterances
14 speakers = {}
15 for utterance in transcript.utterances:
16 speaker = utterance.speaker
17
18 if speaker not in speakers:
19 speakers[speaker] = {
20 "utterances": [],
21 "total_speaking_time": 0,
22 "word_count": 0
23 }
24
25 speakers[speaker]["utterances"].append({
26 "text": utterance.text,
27 "start": utterance.start,
28 "end": utterance.end,
29 "confidence": utterance.confidence
30 })
31
32 # Calculate speaking time
33 speakers[speaker]["total_speaking_time"] += (utterance.end - utterance.start) / 1000
34 speakers[speaker]["word_count"] += len(utterance.text.split())
35
36 meeting_data["speakers"] = speakers
37
38 # Extract summary
39 if transcript.summary:
40 meeting_data["summary"] = transcript.summary
41
42 # Calculate meeting statistics
43 total_duration = transcript.audio_duration # Already in seconds
44 meeting_data["statistics"] = {
45 "total_speakers": len(speakers),
46 "total_words": sum(s["word_count"] for s in speakers.values()),
47 "average_confidence": transcript.confidence,
48 "speaking_distribution": {
49 speaker: {
50 "percentage": (data["total_speaking_time"] / total_duration) * 100,
51 "minutes": data["total_speaking_time"] / 60
52 }
53 for speaker, data in speakers.items()
54 }
55 }
56
57 return meeting_data
58
59# Example usage
60result = process_transcript(transcript)
61print(f"Meeting had {result['statistics']['total_speakers']} speakers")
62print(f"Speaker A spoke for {result['statistics']['speaking_distribution']['A']['minutes']:.1f} minutes")

Processing Streaming Responses

1class StreamingResponseProcessor:
2 def __init__(self):
3 self.partial_buffer = ""
4 self.final_transcripts = []
5 self.turn_metadata = []
6
7 def process_message(self, message: dict):
8 """
9 Process real-time streaming messages
10 """
11 msg_type = message.get("type")
12
13 if msg_type == "Begin":
14 return {
15 "event": "session_started",
16 "session_id": message.get("id"),
17 "expires_at": message.get("expires_at")
18 }
19
20 elif msg_type == "Turn":
21 return self.process_turn(message)
22
23 elif msg_type == "Termination":
24 return {
25 "event": "session_ended",
26 "audio_duration": message.get("audio_duration_seconds"),
27 "session_duration": message.get("session_duration_seconds")
28 }
29
30 def process_turn(self, data: dict):
31 """Process turn messages"""
32 is_final = data.get("end_of_turn")
33 is_formatted = data.get("turn_is_formatted")
34 transcript = data.get("transcript", "")
35 turn_order = data.get("turn_order")
36
37 response = {
38 "turn_order": turn_order,
39 "is_final": is_final,
40 "is_formatted": is_formatted,
41 "confidence": data.get("end_of_turn_confidence", 0)
42 }
43
44 # Handle partials (for live display)
45 if not is_final and transcript:
46 self.partial_buffer = transcript
47 response["event"] = "partial"
48 response["text"] = transcript
49
50 # Handle finals (for storage)
51 elif is_final and is_formatted:
52 final_transcript = {
53 "turn_order": turn_order,
54 "text": transcript,
55 "confidence": data.get("end_of_turn_confidence"),
56 "timestamp": datetime.now().isoformat()
57 }
58 self.final_transcripts.append(final_transcript)
59 response["event"] = "final"
60 response["text"] = transcript
61
62 # Clear partial buffer
63 self.partial_buffer = ""
64
65 return response
66
67 def get_full_transcript(self):
68 """
69 Combine all final transcripts into complete meeting transcript
70 """
71 return {
72 "full_text": " ".join(t["text"] for t in self.final_transcripts),
73 "transcripts": self.final_transcripts,
74 "total_turns": len(self.final_transcripts)
75 }
76
77# Example usage
78processor = StreamingResponseProcessor()
79
80# If you're using `websockets` version 13.0 or later, use `additional_headers` parameter. For older versions (< 13.0), use `extra_headers` instead.
81async with websockets.connect(API_ENDPOINT, additional_headers=headers) as ws:
82 async for message in ws:
83 data = json.loads(message)
84 result = processor.process_message(data)
85
86 if result["event"] == "partial":
87 # Update UI with live transcript
88 update_live_caption(result["text"])
89
90 elif result["event"] == "final":
91 # Save final transcript
92 save_transcript_segment(result)
93
94# Get complete transcript when done
95full_transcript = processor.get_full_transcript()

Additional Resources