Speaker Diarization

The Speaker Diarization model lets you detect multiple speakers in an audio file and what each speaker said.

If you enable Speaker Diarization, the resulting transcript will return a list of utterances, where each utterance corresponds to an uninterrupted segment of speech from a single speaker.

Speaker Diarization and multichannel

Speaker Diarization doesn’t support multichannel transcription. Enabling both Speaker Diarization and multichannel will result in an error.

Quickstart

To enable Speaker Diarization, set speaker_labels to True in the transcription config.

1import assemblyai as aai
2
3aai.settings.api_key = "<YOUR_API_KEY>"
4
5# You can use a local filepath:
6# audio_file = "./example.mp3"
7
8# Or use a publicly-accessible URL:
9audio_file = (
10 "https://assembly.ai/wildfires.mp3"
11)
12
13config = aai.TranscriptionConfig(
14 speaker_labels=True,
15)
16
17transcript = aai.Transcriber().transcribe(audio_file, config)
18
19for utterance in transcript.utterances:
20 print(f"Speaker {utterance.speaker}: {utterance.text}")

API reference

Request

$curl https://api.assemblyai.com/v2/transcript \
>--header "Authorization: <YOUR_API_KEY>" \
>--header "Content-Type: application/json" \
>--data '{
> "audio_url": "YOUR_AUDIO_URL",
> "speaker_labels": true,
> "speakers_expected": 3
>}'
KeyTypeDescription
speaker_labelsbooleanEnable Speaker Diarization.
speaker_expectednumberSet number of speakers.

Response

KeyTypeDescription
utterancesarrayA turn-by-turn temporal sequence of the transcript, where the i-th element is an object containing information about the i-th utterance in the audio file.
utterances[i].confidencenumberThe confidence score for the transcript of this utterance.
utterances[i].endnumberThe ending time, in milliseconds, of the utterance in the audio file.
utterances[i].speakerstringThe speaker of this utterance, where each speaker is assigned a sequential capital letter. For example, “A” for Speaker A, “B” for Speaker B, and so on.
utterances[i].startnumberThe starting time, in milliseconds, of the utterance in the audio file.
utterances[i].textstringThe transcript for this utterance.
utterances[i].wordsarrayA sequential array for the words in the transcript, where the j-th element is an object containing information about the j-th word in the utterance.
utterances[i].words[j].textstringThe text of the j-th word in the i-th utterance.
utterances[i].words[j].startnumberThe starting time for when the j-th word is spoken in the i-th utterance, in milliseconds.
utterances[i].words[j].endnumberThe ending time for when the j-th word is spoken in the i-th utterance, in milliseconds.
utterances[i].words[j].confidencenumberThe confidence score for the transcript of the j-th word in the i-th utterance.
utterances[i].words[j].speakerstringThe speaker who uttered the j-th word in the i-th utterance.

The response also includes the request parameters used to generate the transcript.

Frequently asked questions & troubleshooting

To improve the performance of the Speaker Diarization model, it’s recommended to ensure that each speaker speaks for at least 30 seconds uninterrupted. Avoiding scenarios where a person only speaks a few short phrases like “Yeah”, “Right”, or “Sounds good” can also help. If possible, avoiding cross-talking can also improve performance.

The upper limit on the number of speakers for Speaker Diarization is 10.

The accuracy of the Speaker Diarization model depends on several factors, including the quality of the audio, the number of speakers, and the length of the audio file. Ensuring that each speaker speaks for at least 30 seconds uninterrupted and avoiding scenarios where a person only speaks a few short phrases can improve accuracy. However, it’s important to note that the model isn’t perfect and may make mistakes, especially in more challenging scenarios.

The speaker diarization may be performing poorly if a speaker only speaks once or infrequently throughout the audio file. Additionally, if the speaker speaks in short or single-word utterances, the model may struggle to create separate clusters for each speaker. Lastly, if the speakers sound similar, there may be difficulties in accurately identifying and separating them. Background noise, cross-talk, or an echo may also cause issues.