What is speaker diarization and how does it work? (Complete 2026 Guide)
In this blog post, we'll take a closer look at how speaker diarization works, why it's useful, some of its current limitations, and how to easily use it on audio/video files.



With a recent market survey revealing that 76% of companies embed conversation intelligence in more than half of their customer interactions, understanding its core components is more important than ever. In this blog post, we'll take a closer look at how one of those components, speaker diarization, works, why it's useful, some of its current limitations, and how to easily use it on audio/video files.
What is speaker diarization?
Speaker diarization is an AI process that identifies who spoke when in audio recordings by automatically detecting multiple speakers and assigning speech segments to the correct speaker labels.
In Automatic Speech Recognition (ASR), this involves two key functions: detecting the number of unique speakers and attributing each word to its speaker.
Speaker diarization performs two key functions:
- Speaker Detection: Identifying the number of distinct speakers in an audio file.
- Speaker Attribution: Assigning segments of speech to the correct speaker.
The result is a transcript where each segment of speech is tagged with a speaker label (e.g., "Speaker A," "Speaker B"), making it easy to distinguish between different voices.
How does speaker diarization work?
Speaker diarization applies speaker labels ("Speaker A," "Speaker B") to each transcribed word. Modern AI models execute this through four core steps:
- Step 1: Audio Segmentation
- Step 2: Speaker Embedding Generation
- Step 3: Speaker Count Estimation
- Step 4: Clustering and Assignment
Step 1: Audio segmentation
Audio segmentation divides recordings into utterances of 0.5-10 seconds each. AI models need sufficient audio data to identify speakers accurately.
Example utterances:
- Utterance 1: "Hello my name is Cindy"
- Utterance 2: "I like dogs and live in San Francisco"
Each utterance gets assigned to a speaker label during the clustering process.
There are many ways to break up an audio/video file in a set of utterances, with one common way being to use silence and punctuation markers. In our research, we start seeing a drop off in the feature's ability to correctly assign an utterance to a speaker when utterances are less than one second.
Step 2: Speaker embedding generation
Each utterance passes through an AI model that generates speaker embeddings. These embeddings are high-dimensional numerical representations of unique speaker characteristics.
The visualization below shows how embeddings capture speaker features:

We do a similar process to convert not words, but segments of audio, into embeddings as well.
Step 3: Speaker count estimation
Next, we need to make a choice about how many speakers are present in the audio file—this is a key feature of a modern speaker diarization model. Legacy Diarization systems required knowing how many speakers were in an audio/video file ahead of time, but a major benefit of modern models is that they can accurately predict this number.
Our first goal here is to overestimate the number of speakers. Through clustering methods, we want to estimate the highest number of speakers that is reasonably possible. Why overestimate? It's much easier to combine speakers' utterances if the model breaks them up into different speaker labels than it is to disentangle two speakers being combined into one.
After this initial step, we go back and combine speakers, or disentangle speakers, as needed to get an accurate number.
Step 4: Clustering and assignment
Finally, speaker diarization models take the embeddings (produced above), and cluster them into as many clusters as there are speakers. For example, if a diarization model predicts there are four speakers in an audio file, the embeddings will be forced into four groups based on the "similarity" of the embeddings.
For example, in the below image, let's assume each dot is an utterance. The utterances get clustered together based on their similarity—with the idea being that each cluster is a unique speaker.

There are many ways to determine similarity of embeddings, and this is a core component of accurately predicting speaker labels with a Speaker Diarization model. Recent architectural advances in speaker embedding models have improved clustering accuracy, particularly for short utterances and challenging acoustic conditions.
After this step, you now have a transcription complete with accurate speaker labels!
Today's models can be used to determine up to 26 speakers in the same audio/video file with high accuracy.
How to implement speaker diarization with AssemblyAI
Implementation requires one API parameter: speaker_labels: true
Basic setup:
- Enable
speaker_labels: truein your request - Upload audio file or provide URL
- Receive transcript with speaker labels per word
Here's a complete Python example that shows how to transcribe a local audio file and get back a diarized transcript:
import axios from "axios";
import fs from "fs-extra";
const baseUrl = "https://api.assemblyai.com";
const headers = {
authorization: "<YOUR_API_KEY>",
};
const path = "./audio/audio.mp3";
const audioData = await fs.readFile(path);
const uploadResponse = await axios.post(`${baseUrl}/v2/upload`, audioData, {
headers,
});
const uploadUrl = uploadResponse.data.upload_url;
const data = {
audio_url: uploadUrl, // You can also use a URL to an audio or video file on the web
speaker_labels: true,
speakers_expected: 5
};
const url = `${baseUrl}/v2/transcript`;
const response = await axios.post(url, data, { headers });
const transcriptId = response.data.id;
const pollingEndpoint = `${baseUrl}/v2/transcript/${transcriptId}`;
while (true) {
const pollingResponse = await axios.get(pollingEndpoint, { headers });
const transcriptionResult = pollingResponse.data;
if (transcriptionResult.status === "completed") {
for (const utterance of transcriptionResult.utterances) {
console.log(`Speaker ${utterance.speaker}: ${utterance.text}`);
}
break;
} else if (transcriptionResult.status === "error") {
throw new Error(`Transcription failed: ${transcriptionResult.error}`);
} else {
await new Promise((resolve) => setTimeout(resolve, 3000));
}
}The final output will be a clean, readable transcript where each line of dialogue is clearly attributed to the correct speaker, making the voice data immediately more useful.
Why is speaker diarization useful?
Speaker diarization is useful because it takes a big wall of text and breaks it into something much more meaningful and valuable. If you were to try and read a transcription without speaker labels, your brain would automatically try and assign each word/sentence to the appropriate speaker. It also saves you time and mental energy.
For example, let's look at the before and after transcripts below with and without speaker diarization:
Without:
But how did you guys first meet and how do you guys know each other? I actually met her not too long ago. I met her, I think last year in December, during pre season, we were both practicing at Carson a lot. And then we kind of met through other players. And then I saw her a few her last few torments this year, and we would just practice together sometimes, and she's really, really nice. I obviously already knew who she was because she was so good. Right. So. And I looked up to and I met her. I already knew who she was, but that was cool for me. And then I watch her play her last few events, and then I'm actually doing an exhibition for her charity next month. I think super cool. Yeah. I'm excited to be a part of that. Yeah. Well, we'll definitely highly promote that. Vania and I are both together on the Diversity and Inclusion committee for the USDA, so I'm sure she'll tell me all about that. And we're really excited to have you as a part of that tournament. So thank you so much. And you have had an exciting year so far. My goodness. Within your first WTI 1000 doubles tournament, the Italian Open.Congrats to that. That's huge. Thank you.
With:
Speaker A: But how did you guys first meet and how do you guys know each other?
Speaker B: I actually met her not too long ago. I met her, I think last year in December, during pre season, we were both practicing at Carson a lot. And then we kind of met through other players. And then I saw her a few her last few torments this year, and we would just practice together sometimes, and she's really, really nice. I obviously already knew who she was because she was so good.
Speaker A: Right. So.
Speaker B: And I looked up to and I met her. I already knew who she was, but that was cool for me. And then I watch her play her last few events, and then I'm actually doing an exhibition for her charity next month.
Speaker A: I think super cool.
Speaker B: Yeah. I'm excited to be a part of that.
Speaker A: Yeah. Well, we'll definitely highly promote that. Vania and I are both together on the Diversity and Inclusion committee for the USDA. So I'm sure she'll tell me all about that. And we're really excited to have you as a part of that tournament. So thank you so much. And you have had an exciting year so far. My goodness. Within your first WTI 1000 doubles tournament, the Italian Open. Congrats to that. That's huge.
Speaker B: Thank you.
See how much easier the transcription is to read with speaker diarization?
It is also a powerful analytic tool. In fact, industry research shows that analytics and intelligence are now the most common use cases for conversation intelligence. When you identify and label speakers, you can analyze each speaker's behaviors, identify patterns/trends among individual speakers, make predictions, and more. For example:
- A call center might analyze agent messages versus customer requests, or complaints, to identify trends that could help facilitate better communication.
- A podcast service might use speaker labels to identify the
hostandguest, making transcriptions more readable for end users. - A telemedicine platform might identify
doctorandpatientto create an accurate transcript, attach a readable transcript to patient files, or input the transcript into an EHR system.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.




