Streaming Diarization | AssemblyAI

Streaming Diarization lets you identify and label individual speakers in real time directly from the Streaming API. Each Turn event includes a speaker_label field (e.g. A, B) indicating which speaker produced that transcript. Speaker accuracy improves over the course of a session as the model accumulates embedding context — so the longer the conversation, the better the labels.

Diarization is supported on all streaming models: u3-rt-pro, universal-streaming-english, and universal-streaming-multilingual.

Already using AssemblyAI streaming?

You can enable Streaming Diarization by adding speaker_labels: true to your connection parameters. No other changes are required — the speaker_label field will appear on every Turn event automatically.

Looking for multichannel streaming? See Multichannel streams.

Quickstart

Get started with Streaming Diarization using the code below. This example streams audio from your microphone and prints each turn with its speaker label.

Python SDK

Python

JavaScript SDK

JavaScript

Install the required libraries

$ pip install "assemblyai>=1.0.0" pyaudio

Create a new file main.py and paste the code below. Replace <YOUR_API_KEY> with your API key.

Run with python main.py and speak into your microphone.

1 import logging
2 from typing import Type
3 import assemblyai as aai
4 from assemblyai.streaming.v3 import (
5     BeginEvent,
6     StreamingClient,
7     StreamingClientOptions,
8     StreamingError,
9     StreamingEvents,
10     StreamingParameters,
11     TurnEvent,
12     TerminationEvent,
13 )
14 
15 api_key = "<YOUR_API_KEY>"
16 logging.basicConfig(level=logging.INFO)
17 logger = logging.getLogger(__name__)
18 
19 def on_begin(self: Type[StreamingClient], event: BeginEvent):
20     print(f"Session started: {event.id}")
21 
22 def on_turn(self: Type[StreamingClient], event: TurnEvent):
23     speaker = event.speaker_label or "UNKNOWN"
24     print(f"[{speaker}] {event.transcript} (end_of_turn={event.end_of_turn})")
25 
26 def on_terminated(self: Type[StreamingClient], event: TerminationEvent):
27     print(
28         f"Session terminated: {event.audio_duration_seconds} seconds of audio processed"
29     )
30 
31 def on_error(self: Type[StreamingClient], error: StreamingError):
32     print(f"Error occurred: {error}")
33 
34 def main():
35     client = StreamingClient(
36         StreamingClientOptions(
37             api_key=api_key,
38             api_host="streaming.assemblyai.com",
39         )
40     )
41     client.on(StreamingEvents.Begin, on_begin)
42     client.on(StreamingEvents.Turn, on_turn)
43     client.on(StreamingEvents.Termination, on_terminated)
44     client.on(StreamingEvents.Error, on_error)
45 
46     client.connect(
47         StreamingParameters(
48             sample_rate=16000,
49             speech_model="u3-rt-pro",
50             speaker_labels=True,
51         )
52     )
53     try:
54         client.stream(
55             aai.extras.MicrophoneStream(sample_rate=16000)
56         )
57     finally:
58         client.disconnect(terminate=True)
59 
60 if __name__ == "__main__":
61     main()

Configuration

Enable Streaming Diarization by adding speaker_labels: true to your connection parameters. You can optionally cap the number of speakers with max_speakers.

Parameter	Type	Default	Description
`speaker_labels`	boolean	`false`	Set to `true` to enable real-time speaker diarization.
`max_speakers`	integer	—	Optional. Hint the maximum number of speakers expected (1–10). Setting this accurately can improve assignment accuracy when you know the speaker count in advance.

1 {
2   "speech_model": "u3-rt-pro",
3   "speaker_labels": true,
4   "max_speakers": 2
5 }

Diarization is supported on u3-rt-pro, universal-streaming-english, and universal-streaming-multilingual. You do not need to change your speech model to use it — just add speaker_labels: true.

Reading speaker labels

When diarization is enabled, every Turn event includes a speaker_label field with a label such as A, B, and so on.

1 {
2   "type": "Turn",
3   "transcript": "Good morning, thanks for joining the call.",
4   "speaker_label": "A",
5   "end_of_turn": true,
6   "turn_is_formatted": true
7 }

If the model cannot confidently assign a speaker — typically for very short utterances at the start of a session — the speaker_label field may be null. Your application should handle this case gracefully.

A typical multi-speaker exchange looks like this:

[A] Good morning, thanks for joining the call.
[B] Good morning. Happy to be here.
[A] So let's start with a quick overview of the project timeline.
[B] Sure. We're currently on track for the March deadline.
[A] Great. And how's the team handling the workload?
[C] It's been busy, but manageable. We brought on two new engineers last week.

How speaker accuracy improves over time

Streaming Diarization builds a speaker profile incrementally as audio flows in. In practice this means:

Early in a session, speaker assignments may be less stable, especially if the first few turns are short.
As the session progresses, the model accumulates richer speaker embeddings and assignments become more consistent.

For long-form use cases (call center, clinical scribe, meeting transcription), the model will settle into accurate, stable labels well before the end of the conversation.

Known limitations

Real-time diarization is an inherently harder problem than diarization for async transcription on pre-recorded audio. The following limitations apply to the current beta:

Short utterances — Turns shorter than ~3 words provide insufficient audio for a reliable speaker embedding. Single-word responses like “yes” or “no” may receive a low-confidence or incorrect label.
Overlapping speech — When two speakers talk simultaneously, the model cannot split the audio and will assign the turn to a single speaker. Performance degrades with frequent cross-talk.
Session start accuracy — The first 1–2 turns of a session may be misassigned because the model has not yet built up speaker profiles. This self-corrects quickly in practice.
Noisy environments — Background noise and microphone bleed between speakers can reduce embedding quality and lead to more frequent misassignments.

For the best results, use a microphone setup that minimizes cross-talk and background noise, and ensure each speaker produces at least a few complete sentences before you rely on per-turn labels for downstream processing.

Supported models

Model	`speech_model` value	Diarization supported
Universal-3 Pro (Streaming)	`u3-rt-pro`	✓
Universal Streaming (English)	`universal-streaming-english`	✓
Universal Streaming (Multilingual)	`universal-streaming-multilingual`	✓