Identifying speakers in audio recordings

When applying the Speaker Diarization model, the transcription not only contains the text but also includes speaker labels, enhancing the overall structure and organization of the output. In this step-by-step guide, you’ll learn how to apply the model. In short, you have to send the speaker_labels parameter in your request, and then find the results inside a field called utterances.

Get started

Before we begin, make sure you have an AssemblyAI account and an API key. You can sign up for a free account and get your API key from your dashboard. The complete source code for this guide can be viewed here. Here is an audio example for this guide:

https://assembly.ai/wildfires.mp3

Step-by-step instructions

Python SDK
Tab Title

Install the SDK.

Python (requests)
Python SDK
JavaScript

pip install requests

pip install -U assemblyai

Python SDK
Tab Title

Import the assemblyai package and set the API key.

Python (requests)
Python SDK
JavaScript

base_url = "https://api.assemblyai.com"

headers = {
    "authorization": "<YOUR_API_KEY>"
}

import assemblyai as aai

aai.settings.api_key = "<YOUR_API_KEY>"

const baseUrl = "https://api.assemblyai.com";

const headers = {
  authorization: "<YOUR_API_KEY>",
};

Python SDK
Tab Title

Create a TranscriptionConfig with speaker_labels set to True.

Python (requests)
Python SDK
JavaScript

with open("./my-audio.mp3", "rb") as f:
  response = requests.post(base_url + "/v2/upload",
                          headers=headers,
                          data=f)

upload_url = response.json()["upload_url"]

# highlight-next-line
config = aai.TranscriptionConfig(speaker_labels=True)

const path = "./my-audio.mp3";
const audioData = await fs.readFile(path);
let res = await fetch(`${baseUrl}/v2/upload`, {
  method: "POST",
  headers,
  body: audioData,
});
if (!res.ok) throw new Error(`Error: ${res.status}`);
const uploadResponse = await res.json();
const uploadUrl = uploadResponse.upload_url;

Python SDK
Tab Title

Create a Transcriber object and pass in the configuration.

Use the upload_url returned by the AssemblyAI API to create a JSON payload containing the audio_url parameter and the speaker_labels paramter set to True.

Python (requests)
Python SDK
JavaScript

data = {
    "audio_url": upload_url,
    "speaker_labels": True
}

transcriber = aai.Transcriber(config=config)

const data = {
  audio_url: uploadUrl,
  speaker_labels: true,
};

Python SDK
Tab Title

Use the Transcriber object’s transcribe method and pass in the audio file’s path as a parameter. The transcribe method saves the results of the transcription to the Transcriber object’s transcript attribute.

Make a POST request to the AssemblyAI API endpoint with the payload and headers.

Python (requests)
Python SDK
JavaScript

url = base_url + "/v2/transcript"
response = requests.post(url, json=data, headers=headers)

FILE_URL = "https://assembly.ai/wildfires.mp3"

transcript = transcriber.transcribe(FILE_URL)

const url = `${baseUrl}/v2/transcript`;
let res = await fetch(url, {
  method: "POST",
  headers: { ...headers, "Content-Type": "application/json" },
  body: JSON.stringify(data),
});
if (!res.ok) throw new Error(`Error: ${res.status}`);
const response = await res.json();

Python SDK
Tab Title

You can access the speaker label results through the transcription object’s utterances attribute.

After making the request, you’ll receive an ID for the transcription. Use it to poll the API every few seconds to check the status of the transcript job. Once the status is completed, you can retrieve the transcript from the API response, using the utterances key to access the results.

Python (requests)
Python SDK
JavaScript

transcript_id = response.json()['id']
polling_endpoint = f"https://api.assemblyai.com/v2/transcript/{transcript_id}"

while True:
  transcription_result = requests.get(polling_endpoint, headers=headers).json()

  if transcription_result['status'] == 'completed':
    # when the transcript is complete, extract all utterances from the response
    transcript_text = transcription_result['text']
    utterances = transcription_result['utterances']

    # For each utterance, print its speaker and what was said
    for utterance in utterances:
        speaker = utterance['speaker']
        text = utterance['text']
        print(f"Speaker {speaker}: {text}")

    break

  elif transcription_result['status'] == 'error':
    raise RuntimeError(f"Transcription failed: {transcription_result['error']}")

  else:
    time.sleep(3)

# extract all utterances from the response
utterances = transcript.utterances

# For each utterance, print its speaker and what was said
for utterance in utterances:
  speaker = utterance.speaker
  text = utterance.text
  print(f"Speaker {speaker}: {text}")

const transcriptId = response.id;
const pollingEndpoint = `${baseUrl}/v2/transcript/${transcriptId}`;

while (true) {
  let res = await fetch(pollingEndpoint, { headers });
  if (!res.ok) throw new Error(`Error: ${res.status}`);
  const transcriptionResult = await res.json();

  if (transcriptionResult.status === "completed") {
    const utterances = transcriptionResult.utterances;

    // Iterate through each utterance and print the speaker and the text they spoke
    for (const utterance of utterances) {
      const speaker = utterance.speaker;
      const text = utterance.text;
      console.log(`Speaker ${speaker}: ${text}`);
    }

    break;
  } else if (transcriptionResult.status === "error") {
    throw new Error(`Transcription failed: ${transcriptionResult.error}`);
  } else {
    await new Promise((resolve) => setTimeout(resolve, 3000));
  }
}

Understanding the response

The speaker label information is included in the utterances key of the response. Each utterance object in the list includes a speaker field, which contains a string identifier for the speaker (e.g., “A”, “B”, etc.). The utterances list also contains a text field for each utterance containing the spoken text, and confidence scores both for utterances and their individual words.

{
      utterances: [
        {
          confidence: 0.7246133333333334,
          end: 3738,
          speaker: "A",
          start: 570,
          text: "Um hey, Erica.",
          words: [
            {
              text: "Um",
              start: 570,
              end: 1120,
              confidence: 0.42915,
              speaker: "A",
            },
            {
              text: "hey,",
              start: 2690,
              end: 3054,
              confidence: 0.98465,
              speaker: "A",
            },
            {
              text: "Erica.",
              start: 3092,
              end: 3738,
              confidence: 0.76004,
              speaker: "A",
            },
          ],
        },
        {
          confidence: 0.6015349999999999,
          end: 4430,
          speaker: "B",
          start: 3834,
          text: "One in.",
          words: [
            {
              text: "One",
              start: 3834,
              end: 4094,
              confidence: 0.25,
              speaker: "B",
            },
            {
              text: "in.",
              start: 4132,
              end: 4430,
              confidence: 0.95307,
              speaker: "B",
            },
          ],
        },
      ],
    
}

For more information, see the Speaker Diarization model documentation or see the API reference.

Specifying the number of speakers

You can provide the optional parameter speakers_expected, that can be used to specify the expected number of speakers in an audio file. API/Model Reference

Conclusion

Automatically identifying different speakers from an audio recording, also called speaker diarization, is a multi-step process. It can unlock additional value from many genres of recording, including conference call transcripts, broadcast media, podcasts, and more. You can learn more about use cases for speaker diarization and the underlying research from the AssemblyAI blog.

Getting started

Features

API reference

Advanced

Guides

Identifying speakers in audio recordings

Get started

Step-by-step instructions

Understanding the response

Specifying the number of speakers

Conclusion

​Get started

​Step-by-step instructions

​Understanding the response

​Specifying the number of speakers

​Conclusion

Get started

Step-by-step instructions

Understanding the response

Specifying the number of speakers

Conclusion