Speaker Diarization

The Speaker Diarization model lets you detect multiple speakers in an audio file and what each speaker said.

If you enable Speaker Diarization, the resulting transcript will return a list of utterances, where each utterance corresponds to an uninterrupted segment of speech from a single speaker.

Want to name your speakers?

Speaker Diarization assigns generic labels like “Speaker A” and “Speaker B” to distinguish between speakers. If you want to replace these labels with actual names or roles (e.g., “John Smith” or “Customer”), use Speaker Identification. Speaker Identification analyzes the conversation content to infer who is speaking and transforms your transcript from generic labels to meaningful identifiers.

Quickstart

To enable Speaker Diarization, set speaker_labels to True in the transcription config.

1import assemblyai as aai
2
3aai.settings.api_key = "<YOUR_API_KEY>"
4
5# You can use a local filepath:
6# audio_file = "./example.mp3"
7
8# Or use a publicly-accessible URL:
9audio_file = (
10 "https://assembly.ai/wildfires.mp3"
11)
12
13config = aai.TranscriptionConfig(
14 speech_models=["universal-3-pro", "universal-2"],
15 language_detection=True,
16 speaker_labels=True,
17)
18
19transcript = aai.Transcriber().transcribe(audio_file, config)
20
21for utterance in transcript.utterances:
22 print(f"Speaker {utterance.speaker}: {utterance.text}")

Set number of speakers expected

You can set the number of speakers expected in the audio file by setting the speakers_expected parameter.

Only use this parameter if you are certain about the number of speakers in the audio file.

1import assemblyai as aai
2
3aai.settings.api_key = "<YOUR_API_KEY>"
4
5# You can use a local filepath:
6# audio_file = "./example.mp3"
7
8# Or use a publicly-accessible URL:
9audio_file = (
10 "https://assembly.ai/wildfires.mp3"
11)
12
13config = aai.TranscriptionConfig(
14 speech_models=["universal-3-pro", "universal-2"],
15 language_detection=True,
16 speaker_labels=True,
17 speakers_expected=5,
18)
19
20transcript = aai.Transcriber().transcribe(audio_file, config)
21
22for utterance in transcript.utterances:
23 print(f"Speaker {utterance.speaker}: {utterance.text}")

Set a range of possible speakers

You can set a range of possible speakers in the audio file by setting the speaker_options parameter. By default, the model will return between 1 and 10 speakers.

This parameter is suitable for use cases where there is a known minimum/maximum number of speakers in the audio file that is outside the bounds of the default value of 1 to 10 speakers.

Setting max_speakers_expected too high may reduce diarization accuracy, causing sentences from the same speaker to be split across multiple speaker labels.

1import assemblyai as aai
2
3aai.settings.api_key = "<YOUR_API_KEY>"
4
5# You can use a local filepath:
6# audio_file = "./example.mp3"
7
8# Or use a publicly-accessible URL:
9audio_file = (
10 "https://assembly.ai/wildfires.mp3"
11)
12
13config = aai.TranscriptionConfig(
14 speech_models=["universal-3-pro", "universal-2"],
15 language_detection=True,
16 speaker_labels=True,
17 speaker_options=aai.SpeakerOptions(
18 min_speakers_expected=3,
19 max_speakers_expected=5
20 ),
21)
22
23transcript = aai.Transcriber().transcribe(audio_file, config)
24
25for utterance in transcript.utterances:
26 print(f"Speaker {utterance.speaker}: {utterance.text}")

API reference

Request

Speakers Expected

$curl https://api.assemblyai.com/v2/transcript \
>--header "Authorization: <YOUR_API_KEY>" \
>--header "Content-Type: application/json" \
>--data '{
> "audio_url": "YOUR_AUDIO_URL",
> "speech_model": "universal-3-pro",
> "language_detection": true,
> "speaker_labels": true,
> "speakers_expected": 3
>}'

Speaker Options

$curl https://api.assemblyai.com/v2/transcript \
>--header "Authorization: <YOUR_API_KEY>" \
>--header "Content-Type: application/json" \
>--data '{
> "audio_url": "YOUR_AUDIO_URL",
> "speech_model": "universal-3-pro",
> "language_detection": true,
> "speaker_labels": true,
> "speaker_options": {
> "min_speakers_expected": 3,
> "max_speakers_expected": 5
> }
>}'
KeyTypeDescription
speaker_labelsbooleanEnable Speaker Diarization.
speakers_expectednumberSet number of speakers.
speaker_optionsobjectSet range of possible speakers.
speaker_options.min_speakers_expectednumberThe minimum number of speakers expected in the audio file.
speaker_options.max_speakers_expectednumberThe maximum number of speakers expected in the audio file.

Response

KeyTypeDescription
utterancesarrayA turn-by-turn temporal sequence of the transcript, where the i-th element is an object containing information about the i-th utterance in the audio file.
utterances[i].confidencenumberA score between 0 and 1 indicating the model’s confidence in the accuracy of the transcribed text for this utterance.
utterances[i].endnumberThe ending time, in milliseconds, of the utterance in the audio file.
utterances[i].speakerstringThe speaker of this utterance, where each speaker is assigned a sequential capital letter. For example, “A” for Speaker A, “B” for Speaker B, and so on.
utterances[i].startnumberThe starting time, in milliseconds, of the utterance in the audio file.
utterances[i].textstringThe transcript for this utterance.
utterances[i].wordsarrayA sequential array for the words in the transcript, where the j-th element is an object containing information about the j-th word in the utterance.
utterances[i].words[j].textstringThe text of the j-th word in the i-th utterance.
utterances[i].words[j].startnumberThe starting time for when the j-th word is spoken in the i-th utterance, in milliseconds.
utterances[i].words[j].endnumberThe ending time for when the j-th word is spoken in the i-th utterance, in milliseconds.
utterances[i].words[j].confidencenumberThe confidence score for the transcript of the j-th word in the i-th utterance.
utterances[i].words[j].speakerstringThe speaker who uttered the j-th word in the i-th utterance.

The response also includes the request parameters used to generate the transcript.

Identify speakers by name

Speaker Diarization assigns generic labels like “Speaker A” and “Speaker B” to each speaker. If you want to replace these labels with actual names or roles, you can use Speaker Identification to transform your transcript.

Before Speaker Identification:

Speaker A: Good morning, and welcome to the show.
Speaker B: Thanks for having me.

After Speaker Identification:

Michel Martin: Good morning, and welcome to the show.
Peter DeCarlo: Thanks for having me.

The following example shows how to transcribe audio with Speaker Diarization and then apply Speaker Identification to replace the generic speaker labels with actual names.

1import requests
2import time
3
4base_url = "https://api.assemblyai.com"
5
6headers = {
7 "authorization": "<YOUR_API_KEY>"
8}
9
10audio_url = "https://assembly.ai/wildfires.mp3"
11
12# Configure transcript with speaker diarization and speaker identification
13data = {
14 "audio_url": audio_url,
15 "speech_models": ["universal-3-pro", "universal-2"],
16 "language_detection": True,
17 "speaker_labels": True,
18 "speech_understanding": {
19 "request": {
20 "speaker_identification": {
21 "speaker_type": "name",
22 "known_values": ["Michel Martin", "Peter DeCarlo"]
23 }
24 }
25 }
26}
27
28# Submit the transcription request
29response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
30transcript_id = response.json()["id"]
31polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"
32
33# Poll for transcription results
34while True:
35 transcript = requests.get(polling_endpoint, headers=headers).json()
36
37 if transcript["status"] == "completed":
38 break
39 elif transcript["status"] == "error":
40 raise RuntimeError(f"Transcription failed: {transcript['error']}")
41 else:
42 time.sleep(3)
43
44# Print utterances with identified speaker names
45for utterance in transcript["utterances"]:
46 print(f"{utterance['speaker']}: {utterance['text']}")

For more details on Speaker Identification, including how to identify speakers by role and how to apply it to existing transcripts, see the Speaker Identification guide.

Frequently asked questions & troubleshooting

To improve the performance of the Speaker Diarization model, it’s recommended to ensure that each speaker speaks for at least 30 seconds uninterrupted. Avoiding scenarios where a person only speaks a few short phrases like “Yeah”, “Right”, or “Sounds good” can also help. If possible, avoiding cross-talking can also improve performance.

By default, the upper limit on the number of speakers for Speaker Diarization is 10. If you expect more than 10 speakers, you can use speaker_options to set a range of possible speakers. Please note, setting max_speakers_expected too high may reduce diarization accuracy, causing sentences from the same speaker to be split across multiple speaker labels.

The accuracy of the Speaker Diarization model depends on several factors, including the quality of the audio, the number of speakers, and the length of the audio file. Ensuring that each speaker speaks for at least 30 seconds uninterrupted and avoiding scenarios where a person only speaks a few short phrases can improve accuracy. However, it’s important to note that the model isn’t perfect and may make mistakes, especially in more challenging scenarios.

The speaker diarization may be performing poorly if a speaker only speaks once or infrequently throughout the audio file. Additionally, if the speaker speaks in short or single-word utterances, the model may struggle to create separate clusters for each speaker. Lastly, if the speakers sound similar, there may be difficulties in accurately identifying and separating them. Background noise, cross-talk, or an echo may also cause issues.

speakers_expected should be used only when you are confident that your audio file contains exactly the number of speakers you specify. If this number is incorrect, the diarization process, being forced to find an incorrect number of speakers, may produce random splits of single-speaker segments or merge multiple speakers into one in order to return the specified number of speakers. There are various scenarios where the audio file may include unexpected speakers, such as playback of recorded audio during a conversation or background speech from other people. To account for such cases, it is generally recommended to use min_speakers_expected instead of speakers_expected and to set max_speakers_expected slightly higher (e.g., min_speakers_expected + 2) to allow some flexibility.