Supported models
Supported models
Streaming Diarization
Streaming Diarization lets you identify and label individual speakers in real time directly from the Streaming API. EachTurn event includes a speaker_label
field (e.g. A, B) indicating the dominant speaker for that turn. Each final
word in the words array also carries a speaker field, enabling mid-turn speaker
change detection. Speaker accuracy improves over the course of a session as the model
accumulates embedding context — so the longer the conversation, the better the labels.
Quickstart
Get started with Streaming Diarization using the code below. This example streams audio from your microphone and prints each turn with its speaker label.- Python
- Python SDK
- JavaScript
- JavaScript SDK
Configuration
Enable Streaming Diarization by addingspeaker_labels: true to your connection
parameters. You can optionally cap the number of speakers with max_speakers.
| Parameter | Type | Default | Description |
|---|---|---|---|
speaker_labels | boolean | false | Set to true to enable real-time speaker diarization. |
max_speakers | integer | — | Optional. Hint the maximum number of speakers expected (1–10). Setting this accurately can improve assignment accuracy when you know the speaker count in advance. |
Diarization is supported on all streaming models:
u3-rt-pro,
universal-streaming-english, and universal-streaming-multilingual. You do
not need to change your speech model to use it — just add
speaker_labels: true.Reading speaker labels
When diarization is enabled, everyTurn event includes a speaker_label field
reflecting the dominant speaker for that turn.
Word-level speaker labels
Each final word in thewords array also carries a speaker field. This allows
you to detect speaker changes within a single turn — for example, a turn where one
speaker finishes another’s sentence, or where a brief interjection appears mid-turn.
speaker:
- Final words only. The
speakerfield only appears on words whereword_is_final: true. Non-final (in-progress) words never carry it. speakercan be absent on individual words. If the field is missing from a word entirely, treat that word as unattributed and fall back to the turn-levelspeaker_labelif you need a label. Absent means the field is omitted from the JSON — it will never benull.UNKNOWNat word level means the model couldn’t confidently attribute that word to any specific speaker — common for short backchannels (“uh huh”, “yeah”) or brief low-quality audio segments. It is not an ambiguity flag between two known speakers; words in a confidently-attributed stretch carry the speaker’s letter, notUNKNOWN.
speaker_label
will be set to "UNKNOWN". This is because the model needs at least ~1 second of
audio to generate a reliable diarization embedding — without enough audio, embeddings
may be inaccurate and could lead to a single speaker being labeled as multiple
speakers. Labeling short turns as "UNKNOWN" ensures that speaker labels remain
as accurate as possible.
How speaker accuracy improves over time
Streaming Diarization builds a speaker profile incrementally as audio flows in. In practice this means:- Early in a session, speaker assignments may be less stable, especially if the first few turns are short.
- As the session progresses, the model accumulates richer speaker embeddings and assignments become more consistent.
Revised speaker labels
During a live session, Streaming Diarization assignsspeaker_label values in
real time as each Turn is emitted. These labels can shift as the session
progresses and more audio becomes available. Early turns in particular may
be reassigned as the model builds a clearer picture of each speaker.
When the session ends, the server performs a final refinement pass with full
visibility into the entire conversation. Any turns whose speaker labels changed
are sent back as a single SpeakerRevision message. Turns that were already
correct are omitted.
Use the revised labels whenever you need the highest-quality speaker attribution
for the final transcript — for example, when persisting a meeting transcript,
generating a post-call summary, or feeding text into a downstream LLM that
benefits from accurate speaker turns.
The end-of-session refinement adds approximately 400ms of latency. Any
SpeakerRevision messages arrive before the Termination message and do
not affect the real-time speaker_label values delivered during the session.Message shape
A singleSpeakerRevision message is sent at the end of the session containing
a revisions array. Each item corrects one turn; only turns whose speaker
assignments changed are included. The turn_order field in each item matches
the turn_order of the original Turn message being revised.
| Field | Type | Description |
|---|---|---|
type | "SpeakerRevision" | Message type identifier. |
revisions | Revision[] | Array of turn corrections. Only turns whose labels changed are included. |
revisions[].turn_order | integer | Matches the turn_order of the original Turn being corrected. |
revisions[].speaker_label | string | null | Corrected turn-level speaker label. |
revisions[].words | Word[] | Words with corrected per-word speaker assignments. |
Text content and word timestamps are never changed. Only speaker
assignments are revised.
How to handle it
Match eachturn_order in revisions against the turn you already received,
then replace its speaker_label and per-word speaker values.
- Python
- Python SDK
- JavaScript
When it is sent
SpeakerRevision messages are only sent at the end of a stream, after the
Terminate signal. They are never emitted mid-session. A given session may
produce zero or many revisions. Only turns whose speaker assignments changed
are included.
If the session ends unexpectedly (network drop, error closure), revisions may
not be delivered. Always handle this gracefully and fall back to the live
labels you received during the session.
Known limitations
Real-time diarization is an inherently harder problem than diarization for async transcription on pre-recorded audio. The following limitations apply to the current beta:- Short utterances — Turns with less than ~1 second of audio are labeled
as
"UNKNOWN"because there is insufficient audio to generate a reliable speaker embedding. This prevents inaccurate embeddings from causing a single speaker to be split across multiple labels. - Overlapping speech — When two speakers talk simultaneously, the model cannot split the audio and will assign the turn to a single speaker. Performance degrades with frequent cross-talk.
- Session start accuracy — The first 1–2 turns of a session may be misassigned because the model has not yet built up speaker profiles. This self-corrects quickly in practice.
- Noisy environments — Background noise and microphone bleed between speakers can reduce embedding quality and lead to more frequent misassignments.
Supported models
| Model | speech_model value | Diarization supported |
|---|---|---|
| Universal-3 Pro Streaming | u3-rt-pro | ✓ |
| Universal Streaming (English) | universal-streaming-english | ✓ |
| Universal Streaming (Multilingual) | universal-streaming-multilingual | ✓ |
Multichannel streaming audio
To transcribe multichannel streaming audio, we recommend creating a separate session for each channel. This approach allows you to maintain clear speaker separation and get accurate diarized transcriptions for conversations, phone calls, or interviews where speakers are recorded on two different channels. The following code example demonstrates how to transcribe a dual-channel audio file with diarized, speaker-separated transcripts. This same approach can be applied to any multi-channel audio stream, including those with more than two channels.- Python
- Python SDK
- JavaScript
- JavaScript SDK
Configure turn detection for your use caseThe examples above use turn detection settings optimized for short responses and rapid back-and-forth conversations. To optimize for your specific audio scenario, you can adjust the turn detection parameters.For configuration examples tailored to different use cases, refer to our Configuration examples.
- Python
- Python SDK
- JavaScript
- JavaScript SDK
Modify the turn detection parameters in
API_PARAMS: