Streaming Diarization
Streaming Diarization lets you identify and label individual speakers in real
time directly from the Streaming API. Each Turn event includes a speaker_label
field (e.g. A, B) indicating which speaker produced that
transcript. Speaker accuracy improves over the course of a session as the model
accumulates embedding context — so the longer the conversation, the better the
labels.
Diarization is supported on all streaming models: u3-rt-pro,
universal-streaming-english, and universal-streaming-multilingual.
Already using AssemblyAI streaming?
You can enable Streaming Diarization by adding speaker_labels: true to your
connection parameters. No other changes are required — the speaker_label
field will appear on every Turn event automatically.
Looking for multichannel streaming? See Multichannel streams.
Quickstart
Get started with Streaming Diarization using the code below. This example streams audio from your microphone and prints each turn with its speaker label.
Python SDK
Python
JavaScript SDK
JavaScript
Configuration
Enable Streaming Diarization by adding speaker_labels: true to your connection
parameters. You can optionally cap the number of speakers with max_speakers.
Diarization is supported on u3-rt-pro, universal-streaming-english, and
universal-streaming-multilingual. You do not need to change your speech
model to use it — just add speaker_labels: true.
Reading speaker labels
When diarization is enabled, every Turn event includes a speaker_label field
with a label such as A, B, and so on.
If the model cannot confidently assign a speaker — typically for very short
utterances at the start of a session — the speaker_label field may be null.
Your application should handle this case gracefully.
A typical multi-speaker exchange looks like this:
How speaker accuracy improves over time
Streaming Diarization builds a speaker profile incrementally as audio flows in. In practice this means:
- Early in a session, speaker assignments may be less stable, especially if the first few turns are short.
- As the session progresses, the model accumulates richer speaker embeddings and assignments become more consistent.
For long-form use cases (call center, clinical scribe, meeting transcription), the model will settle into accurate, stable labels well before the end of the conversation.
Known limitations
Real-time diarization is an inherently harder problem than diarization for async transcription on pre-recorded audio. The following limitations apply to the current beta:
- Short utterances — Turns shorter than ~3 words provide insufficient audio for a reliable speaker embedding. Single-word responses like “yes” or “no” may receive a low-confidence or incorrect label.
- Overlapping speech — When two speakers talk simultaneously, the model cannot split the audio and will assign the turn to a single speaker. Performance degrades with frequent cross-talk.
- Session start accuracy — The first 1–2 turns of a session may be misassigned because the model has not yet built up speaker profiles. This self-corrects quickly in practice.
- Noisy environments — Background noise and microphone bleed between speakers can reduce embedding quality and lead to more frequent misassignments.
For the best results, use a microphone setup that minimizes cross-talk and background noise, and ensure each speaker produces at least a few complete sentences before you rely on per-turn labels for downstream processing.