For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
PlaygroundChangelogSign In
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
  • Getting started
    • Transcribe a pre-recorded audio file
    • Model selection
    • View model benchmarks
    • Evaluate model accuracy
    • Cloud endpoints & data residency
    • Manage concurrent requests
    • Webhooks
  • Models
    • Medical Mode
  • Features
    • Boost specific terms
    • Label speakers
    • Transcribe multiple audio channels
    • Transcribe audio with mixed languages
    • Correct spelling of terms
    • Include filler words
    • Search for words in transcript
    • Set the start and end of the transcript
  • Guides
      • Build a meeting notetaker
      • Build a medical scribe
      • Build a contact center application
        • Identifying Speakers in Audio Recordings
        • Iterate over Speaker Labels with Make.com
        • Calculate the Talk / Listen Ratio of Speakers
        • Plot A Speaker Timeline with Matplotlib
        • Generate Custom Speaker Labels with Pyannote
        • Use Speaker Diarization with Async Chunking
        • Setup A Speaker Identification System using Pinecone & Nvidia TitaNet
LogoLogo
PlaygroundChangelogSign In
On this page
  • Get started
  • Step-by-step instructions
  • Understanding the response
  • Specifying the number of speakers
  • Conclusion
GuidesTutorialsSpeaker labels

Identifying speakers in audio recordings

Was this page helpful?
Built with

When applying the Speaker Diarization model, the transcription not only contains the text but also includes speaker labels, enhancing the overall structure and organization of the output.

In this step-by-step guide, you’ll learn how to apply the model. In short, you have to send the speaker_labels parameter in your request, and then find the results inside a field called utterances.

Get started

Before we begin, make sure you have an AssemblyAI account and an API key. You can sign up for a free account and get your API key from your dashboard.

The complete source code for this guide can be viewed here.

Here is an audio example for this guide:

$https://assembly.ai/wildfires.mp3

Step-by-step instructions

1
Python SDK
Untitled

Install the SDK.

Python (requests)
Python SDK
JavaScript
$pip install requests
2
Python SDK
Untitled

Import the assemblyai package and set the API key.

Python (requests)
Python SDK
JavaScript
1base_url = "https://api.assemblyai.com"
2
3headers = {
4 "authorization": "<YOUR_API_KEY>"
5}
3
Python SDK
Untitled

Create a TranscriptionConfig with speaker_labels set to True.

Python (requests)
Python SDK
JavaScript
1with open("./my-audio.mp3", "rb") as f:
2 response = requests.post(base_url + "/v2/upload",
3 headers=headers,
4 data=f)
5
6upload_url = response.json()["upload_url"]
4
Python SDK
Untitled

Create a Transcriber object and pass in the configuration.

Python (requests)
Python SDK
JavaScript
1data = {
2 "audio_url": upload_url,
3 "speaker_labels": True
4}
5
Python SDK
Untitled

Use the Transcriber object’s transcribe method and pass in the audio file’s path as a parameter. The transcribe method saves the results of the transcription to the Transcriber object’s transcript attribute.

Python (requests)
Python SDK
JavaScript
1url = base_url + "/v2/transcript"
2response = requests.post(url, json=data, headers=headers)
6
Python SDK
Untitled

You can access the speaker label results through the transcription object’s utterances attribute.

Python (requests)
Python SDK
JavaScript
1transcript_id = response.json()['id']
2polling_endpoint = f"https://api.assemblyai.com/v2/transcript/{transcript_id}"
3
4while True:
5 transcription_result = requests.get(polling_endpoint, headers=headers).json()
6
7 if transcription_result['status'] == 'completed':
8 # when the transcript is complete, extract all utterances from the response
9 transcript_text = transcription_result['text']
10 utterances = transcription_result['utterances']
11
12 # For each utterance, print its speaker and what was said
13 for utterance in utterances:
14 speaker = utterance['speaker']
15 text = utterance['text']
16 print(f"Speaker {speaker}: {text}")
17
18 break
19
20 elif transcription_result['status'] == 'error':
21 raise RuntimeError(f"Transcription failed: {transcription_result['error']}")
22
23 else:
24 time.sleep(3)

Understanding the response

The speaker label information is included in the utterances key of the response. Each utterance object in the list includes a speaker field, which contains a string identifier for the speaker (e.g., “A”, “B”, etc.). The utterances list also contains a text field for each utterance containing the spoken text, and confidence scores both for utterances and their individual words.

For more information, see the Speaker Diarization model documentation or see the API reference.

Specifying the number of speakers

You can provide the optional parameter speakers_expected, that can be used to specify the expected number of speakers in an audio file.

Conclusion

Automatically identifying different speakers from an audio recording, also called speaker diarization, is a multi-step process. It can unlock additional value from many genres of recording, including conference call transcripts, broadcast media, podcasts, and more. You can learn more about use cases for speaker diarization and the underlying research from the AssemblyAI blog.