Speaker Identification

Supported languages

Supported models

Supported regions

US & EU

Overview

Replace generic “Speaker A” and “Speaker B” labels with real names or roles, no voice enrollment needed. Speaker Identification uses conversation content to infer who’s speaking and applies the identifiers you provide. It can infer roles from context clues, and can associate names when the names are present within the transcript. Example transformation: Before:

Speaker A: Good morning, and welcome to the show.
Speaker B: Thanks for having me.
Speaker A: Let's dive into today's topic...

After (by name):

Michel Martin: Good morning, and welcome to the show.
Peter DeCarlo: Thanks for having me.
Michel Martin: Let's dive into today's topic...

After (by role):

Interviewer: Good morning, and welcome to the show.
Interviewee: Thanks for having me.
Interviewer: Let's dive into today's topic...

Speaker Identification requires Speaker Diarization. You must set speaker_labels: true in your transcription request.

To reliably identify speakers, your audio should contain clear, distinguishable voices and sufficient spoken audio from each speaker. The accuracy of Speaker Diarization depends on the quality of the audio and the distinctiveness of each speaker’s voice, which will have a downstream effect on the quality of Speaker Identification.

Choosing how to identify speakers

You can identify speakers by name or by role:

Know the speakers’ names? Use speaker_type: "name" with the names in speakers. Click here to learn more.
Know their roles but not names? Use speaker_type: "role" with roles like "Interviewer" or "Agent" in speakers. Click here to learn more.
Need better accuracy? Use speakers with description fields that provide context about what each speaker typically discusses. Click here to learn more.

How to use Speaker Identification

Include the speech_understanding parameter in your transcription request to identify speakers as part of transcription.

Identify by name

To identify speakers by name, use speaker_type: "name" with a list of speaker names in speakers. This is the most common approach when you know who is speaking in the audio.

Python
JavaScript
Python SDK
JavaScript SDK

import requests
import time

base_url = "https://api.assemblyai.com"

headers = {
    "authorization": "<YOUR_API_KEY>"
}

# Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file

upload_url = "https://assembly.ai/wildfires.mp3"

# Configure transcript with speaker identification

data = {
    "audio_url": upload_url,
    "language_detection": True,
    "speaker_labels": True,
    "speech_understanding": {
        "request": {
            "speaker_identification": {
                "speaker_type": "name",
                "speakers": [{"name": "Michel Martin"}, {"name": "Peter DeCarlo"}]  # Change these values to match the names of the speakers in your file
            }
        }
    }
}

# Submit the transcription request

response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
transcript_id = response.json()["id"]
polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"

# Poll for transcription results

while True:
    transcript = requests.get(polling_endpoint, headers=headers).json()

    if transcript["status"] == "completed":
        break

    elif transcript["status"] == "error":
        raise RuntimeError(f"Transcription failed: {transcript['error']}")

    else:
        time.sleep(3)

# Access the results and print utterances to the terminal

for utterance in transcript["utterances"]:
    print(f"{utterance['speaker']}: {utterance['text']}")

const baseUrl = "https://api.assemblyai.com";

const headers = {
  "authorization": "<YOUR_API_KEY>",
  "content-type": "application/json"
};

// Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file
const uploadUrl = "https://assembly.ai/wildfires.mp3";

// Configure transcript with speaker identification
const data = {
  audio_url: uploadUrl,
  language_detection: true,
  speaker_labels: true,
  speech_understanding: {
    request: {
      speaker_identification: {
        speaker_type: "name",
        speakers: [{"name": "Michel Martin"}, {"name": "Peter DeCarlo"}]  // Change these values to match the names of the speakers in your file
      }
    }
  }
};

async function main() {
  // Submit the transcription request
  const response = await fetch(`${baseUrl}/v2/transcript`, {
    method: "POST",
    headers: headers,
    body: JSON.stringify(data)
  });

  const { id: transcriptId } = await response.json();
  const pollingEndpoint = `${baseUrl}/v2/transcript/${transcriptId}`;

  // Poll for transcription results
  while (true) {
    const pollingResponse = await fetch(pollingEndpoint, { headers });
    const transcript = await pollingResponse.json();

    if (transcript.status === "completed") {
      // Access the results and print utterances to the console
      for (const utterance of transcript.utterances) {
        console.log(`${utterance.speaker}: ${utterance.text}`);
      }
      break;
    } else if (transcript.status === "error") {
      throw new Error(`Transcription failed: ${transcript.error}`);
    } else {
      await new Promise(resolve => setTimeout(resolve, 3000));
    }
  }
}

main().catch(console.error);

import assemblyai as aai

aai.settings.api_key = "<YOUR_API_KEY>"

# Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file
audio_url = "https://assembly.ai/wildfires.mp3"

# Configure transcript with speaker identification
config = aai.TranscriptionConfig(
    language_detection=True,
    speaker_labels=True,
    speech_understanding=aai.SpeechUnderstandingRequest(
        request=aai.SpeechUnderstandingFeatureRequests(
            speaker_identification=aai.SpeakerIdentificationRequest(
                speaker_type="name",
                speakers=[{"name": "Michel Martin"}, {"name": "Peter DeCarlo"}]  # Change these values to match the names of the speakers in your file
            )
        )
    )
)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(audio_url, config)

# Access the results and print utterances to the terminal
for utterance in transcript.utterances:
    print(f"{utterance.speaker}: {utterance.text}")

import { AssemblyAI } from "assemblyai";

const client = new AssemblyAI({
  apiKey: "<YOUR_API_KEY>"
});

// Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file
const audioUrl = "https://assembly.ai/wildfires.mp3";

// Configure transcript with speaker identification
const params = {
  audio: audioUrl,
  language_detection: true,
  speaker_labels: true,
  speech_understanding: {
    request: {
      speaker_identification: {
        speaker_type: "name",
        speakers: [{"name": "Michel Martin"}, {"name": "Peter DeCarlo"}]  // Change these values to match the names of the speakers in your file
      }
    }
  }
};

const transcript = await client.transcripts.transcribe(params);

// Access the results and print utterances to the console
for (const utterance of transcript.utterances) {
  console.log(`${utterance.speaker}: ${utterance.text}`);
}

Identify by role

To identify speakers by role instead of name, use speaker_type: "role" with role labels in speakers. This is useful for customer service calls, interviews, or any scenario where you know the roles but not the names.

Python
JavaScript
Python SDK
JavaScript SDK

import requests
import time

base_url = "https://api.assemblyai.com"

headers = {
    "authorization": "<YOUR_API_KEY>"
}

# Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file

upload_url = "https://assembly.ai/wildfires.mp3"

# Configure transcript with role-based speaker identification

data = {
    "audio_url": upload_url,
    "language_detection": True,
    "speaker_labels": True,
    "speech_understanding": {
        "request": {
            "speaker_identification": {
                "speaker_type": "role",
                "speakers": [{"role": "Interviewer"}, {"role": "Interviewee"}]  # Change these values to match the roles of the speakers in your file
            }
        }
    }
}

# Submit the transcription request

response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)
transcript_id = response.json()["id"]
polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"

# Poll for transcription results

while True:
    transcript = requests.get(polling_endpoint, headers=headers).json()

    if transcript["status"] == "completed":
        break

    elif transcript["status"] == "error":
        raise RuntimeError(f"Transcription failed: {transcript['error']}")

    else:
        time.sleep(3)

# Access the results and print utterances to the terminal

for utterance in transcript["utterances"]:
    print(f"{utterance['speaker']}: {utterance['text']}")

const baseUrl = "https://api.assemblyai.com";

const headers = {
  "authorization": "<YOUR_API_KEY>",
  "content-type": "application/json"
};

// Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file
const uploadUrl = "https://assembly.ai/wildfires.mp3";

// Configure transcript with role-based speaker identification
const data = {
  audio_url: uploadUrl,
  language_detection: true,
  speaker_labels: true,
  speech_understanding: {
    request: {
      speaker_identification: {
        speaker_type: "role",
        speakers: [{"role": "Interviewer"}, {"role": "Interviewee"}]  // Change these values to match the roles of the speakers in your file
      }
    }
  }
};

async function main() {
  // Submit the transcription request
  const response = await fetch(`${baseUrl}/v2/transcript`, {
    method: "POST",
    headers: headers,
    body: JSON.stringify(data)
  });

  const { id: transcriptId } = await response.json();
  const pollingEndpoint = `${baseUrl}/v2/transcript/${transcriptId}`;

  // Poll for transcription results
  while (true) {
    const pollingResponse = await fetch(pollingEndpoint, { headers });
    const transcript = await pollingResponse.json();

    if (transcript.status === "completed") {
      // Access the results and print utterances to the console
      for (const utterance of transcript.utterances) {
        console.log(`${utterance.speaker}: ${utterance.text}`);
      }
      break;
    } else if (transcript.status === "error") {
      throw new Error(`Transcription failed: ${transcript.error}`);
    } else {
      await new Promise(resolve => setTimeout(resolve, 3000));
    }
  }
}

main().catch(console.error);

import assemblyai as aai

aai.settings.api_key = "<YOUR_API_KEY>"

audio_url = "https://assembly.ai/wildfires.mp3"

config = aai.TranscriptionConfig(
    language_detection=True,
    speaker_labels=True,
    speech_understanding=aai.SpeechUnderstandingRequest(
        request=aai.SpeechUnderstandingFeatureRequests(
            speaker_identification=aai.SpeakerIdentificationRequest(
                speaker_type="role",
                speakers=[{"role": "Interviewer"}, {"role": "Interviewee"}]  # Change these values to match the roles of the speakers in your file
            )
        )
    )
)

transcriber = aai.Transcriber()
transcript = transcriber.transcribe(audio_url, config)

for utterance in transcript.utterances:
    print(f"{utterance.speaker}: {utterance.text}")

import { AssemblyAI } from "assemblyai";

const client = new AssemblyAI({
  apiKey: "<YOUR_API_KEY>"
});

const audioUrl = "https://assembly.ai/wildfires.mp3";

const params = {
  audio: audioUrl,
  language_detection: true,
  speaker_labels: true,
  speech_understanding: {
    request: {
      speaker_identification: {
        speaker_type: "role",
        speakers: [{"role": "Interviewer"}, {"role": "Interviewee"}]  // Change these values to match the roles of the speakers in your file
      }
    }
  }
};

const transcript = await client.transcripts.transcribe(params);

for (const utterance of transcript.utterances) {
  console.log(`${utterance.speaker}: ${utterance.text}`);
}

Common role combinations

[{"role": "Agent"}, {"role": "Customer"}] - Customer service calls
[{"role": "AI Assistant"}, {"role": "User"}] - AI chatbot interactions
[{"role": "Support"}, {"role": "Customer"}] - Technical support calls
[{"role": "Interviewer"}, {"role": "Interviewee"}] - Interview recordings
[{"role": "Host"}, {"role": "Guest"}] - Podcast or show recordings
[{"role": "Moderator"}, {"role": "Panelist"}] - Panel discussions

Controlling Effort

Sometimes when you are using the speaker identification the results are not quite as good high quality as you want. To assist with this there is a configuration to increase the quality of the results at a higher cost. Speaker Identification has two effort modes currently: low and medium. The default is low.

When to use low/default effort

The low/default effort works best for lower complexity situations. Some examples:

Transcripts less than 10m long
Transcripts where there are clearly segmented speakers
- doctor/patient meeting
- podcasts
- customer support calls
- note: audio size is less restrictive here
Names and/or roles are well demonstrated in the transcript
- names clearly spoken
- immediately clear who has which role
  - example: “Hi thank you for calling insert_company_name, how can I help you” clearly indicates a customer support role

When to use medium effort

The medium effort works best for higher complexity transcripts, where information is less clear and/or speakers are less defined. Some example:

Meeting call in a conferance room
Arguments and/or elevated conversations where individuals interrupt each other
Transcripts where one or more names are not clearly stated
- note: names cannot be synthesized from nothing, we cannot extract names if none are present in the transcript
Transcripts where the roles are nuanced or not clear

Apply to an existing transcript

If you already have a completed transcript, you can add Speaker Identification in a separate request to the Speech Understanding API. This is useful when you want to re-identify speakers with different parameters, or when your workflow separates transcription from post-processing. First, transcribe your audio with speaker_labels: true. Once transcription is complete, send the transcript_id along with your speaker identification configuration to the Speech Understanding API.

Python
JavaScript

import requests
import time

base_url = "https://api.assemblyai.com"

headers = {
    "authorization": "<YOUR_API_KEY>"
}

# Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file

upload_url = "https://assembly.ai/wildfires.mp3"

data = {
    "audio_url": upload_url,
    "language_detection": True,
    "speaker_labels": True
}

# Transcribe file

response = requests.post(base_url + "/v2/transcript", headers=headers, json=data)

transcript_id = response.json()["id"]
polling_endpoint = base_url + f"/v2/transcript/{transcript_id}"

# Poll for transcription results

while True:
    transcript = requests.get(polling_endpoint, headers=headers).json()

    if transcript["status"] == "completed":
        break

    elif transcript["status"] == "error":
        raise RuntimeError(f"Transcription failed: {transcript['error']}")

    else:
        time.sleep(3)

# Enable speaker identification

understanding_body = {
    "transcript_id": transcript_id,
    "speech_understanding": {
        "request": {
            "speaker_identification": {
                "speaker_type": "name",
                "speakers": [{"name": "Michel Martin"}, {"name": "Peter DeCarlo"}]  # Change these values to match the names of the speakers in your file
            }
        }
    }
}

# Send the modified transcript to the Speech Understanding API

result = requests.post(
    "https://llm-gateway.assemblyai.com/v1/understanding",
    headers=headers,
    json=understanding_body
).json()

# Access the results and print utterances to the terminal

for utterance in result["utterances"]:
    print(f"{utterance['speaker']}: {utterance['text']}")

const baseUrl = "https://api.assemblyai.com";
const apiKey = "<YOUR_API_KEY>";

const headers = {
  "authorization": apiKey,
  "content-type": "application/json"
};

// Need to transcribe a local file? Learn more here: https://www.assemblyai.com/docs/getting-started/transcribe-an-audio-file
const uploadUrl = "https://assembly.ai/wildfires.mp3";

async function transcribeAndIdentifySpeakers() {
  // Transcribe file
  const transcriptResponse = await fetch(`${baseUrl}/v2/transcript`, {
    method: 'POST',
    headers: headers,
    body: JSON.stringify({
      audio_url: uploadUrl,
      language_detection: true,
      speaker_labels: true
    })
  });

  const { id: transcriptId } = await transcriptResponse.json();
  const pollingEndpoint = `${baseUrl}/v2/transcript/${transcriptId}`;

  // Poll for transcription results
  while (true) {
    const pollingResponse = await fetch(pollingEndpoint, { headers });
    const transcript = await pollingResponse.json();

    if (transcript.status === "completed") {
      break;
    } else if (transcript.status === "error") {
      throw new Error(`Transcription failed: ${transcript.error}`);
    } else {
      await new Promise(resolve => setTimeout(resolve, 3000));
    }
  }

  // Enable speaker identification
  const understandingBody = {
    transcript_id: transcriptId,
    speech_understanding: {
      request: {
        speaker_identification: {
          speaker_type: "name",
          speakers: [{"name": "Michel Martin"}, {"name": "Peter DeCarlo"}]  // Change these values to match the names of the speakers in your file
        }
      }
    }
  };

  // Send the modified transcript to the Speech Understanding API
  const understandingResponse = await fetch(
    "https://llm-gateway.assemblyai.com/v1/understanding",
    {
      method: 'POST',
      headers: headers,
      body: JSON.stringify(understandingBody)
    }
  );

  const result = await understandingResponse.json();

  // Access the results and print utterances to the terminal
  for (const utterance of result.utterances) {
    console.log(`${utterance.speaker}: ${utterance.text}`);
  }
}

transcribeAndIdentifySpeakers();

The example above identifies speakers by name. To identify by role, keep the same two-step flow and set speaker_type: "role" with role labels in speakers (see Identify by role). The speakers metadata approach works with this flow too.

Adding speaker metadata

The speakers parameter lets you provide additional metadata about each speaker to help the model identify speakers based on conversational context. This is particularly useful when:

Speakers have similar voices but distinct roles or topics
You want to provide contextual clues about what each speaker typically discusses
You need more precise identification in complex multi-speaker scenarios

Each speaker object must include either a name or role (depending on speaker_type). Beyond that, you can add any additional properties you want. The name and role fields are reserved as strings, but all other properties are flexible and can be any structure.

Examples in this section are shown in Python for brevity. The same speaker_identification configuration works in any language.

At its simplest, you can provide a description alongside each speaker’s name or role:

data = {
  "audio_url": upload_url,
  "speaker_labels": True,
  "speech_understanding": {
    "request": {
      "speaker_identification": {
        "speaker_type": "role",
        "speakers": [
          {
            "role": "interviewer",
            "description": "Hosts the program and interviews the guests"
          },
          {
            "role": "guest",
            "description": "Answers questions from the interview"
          }
        ]
      }
    }
  }
}

For even more fine-tuned identification, you can include any additional custom properties on each speaker object, such as company, title, department, or any other fields that help describe the speaker:

data = {
  "audio_url": upload_url,
  "speaker_labels": True,
  "speech_understanding": {
    "request": {
      "speaker_identification": {
        "speaker_type": "name",
        "speakers": [
          {
            "name": "Michel Martin",
            "description": "Hosts the program and interviews the guests",
            "company": "NPR",
            "title": "Host Morning Edition"
          },
          {
            "name": "Peter DeCarlo",
            "description": "Answers questions from the interview",
            "company": "Johns Hopkins University",
            "title": "Professor and Vice Chair of Environmental Health and Engineering"
          }
        ]
      }
    }
  }
}

You can use the same custom properties with role-based identification by replacing name with role in each speaker object.

API reference

Request

In a transcription request

Include the speech_understanding parameter directly in your transcription request (shown here with name-based identification):

curl -X POST \
  "https://api.assemblyai.com/v2/transcript" \
  -H "Authorization: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://assembly.ai/wildfires.mp3",
    "speaker_labels": true,
    "speech_understanding": {
      "request": {
        "speaker_identification": {
          "speaker_type": "name",
          "speakers": [{"name": "Michel Martin"}, {"name": "Peter DeCarlo"}]
        }
      }
    }
  }'

On an existing transcript

Transcribe with speaker_labels: true, then send the completed transcript_id to the Speech Understanding API:

# Step 1: Submit transcription job
curl -X POST "https://api.assemblyai.com/v2/transcript" \
  -H "authorization: <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "audio_url": "https://assembly.ai/wildfires.mp3",
    "speaker_labels": true
  }'

# Save the transcript_id from the response above, then use it in the following commands

# Step 2: Poll for transcription status (repeat until status is "completed")
curl -X GET "https://api.assemblyai.com/v2/transcript/{transcript_id}" \
  -H "authorization: <YOUR_API_KEY>"

# Step 3: Once transcription is completed, enable speaker identification
curl -X POST "https://llm-gateway.assemblyai.com/v1/understanding" \
  -H "authorization: <YOUR_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "transcript_id": "{transcript_id}",
    "speech_understanding": {
      "request": {
        "speaker_identification": {
          "speaker_type": "name",
          "speakers": [{"name": "Michel Martin"}, {"name": "Peter DeCarlo"}]
        }
      }
    }
  }'

Request parameters

The following parameters are nested under speech_understanding.request.speaker_identification:

Key	Type	Required?	Description
`speaker_type`	string	Yes	The type of speakers being identified, values accepted are “name” for actual names or “role” for roles/titles.
`speakers`	array	Conditional	An array of speaker objects with metadata. You can include any additional custom properties beyond `name`/`role` and `description`.
`speakers[].role`	string	Conditional	The role of the speaker. Required when `speaker_type` is “role”.
`speakers[].name`	string	Conditional	The name of the speaker. Required when `speaker_type` is “name”.
`speakers[].description`	string	No	A description of the speaker to help the model identify them based on conversational context.
`speakers[].<custom>`	any	No	Any additional custom properties (e.g., `company`, `title`, `department`) to provide more context about the speaker. The `name` and `role` fields are reserved as strings, but all other properties are flexible.

Response

The Speaker Identification API returns a modified version of your transcript with updated speaker labels in the utterances key.

{
  "speech_understanding": {
    "request": {
      "speaker_identification": {
        "speaker_type": "name",
        "speakers": [{"name": "Michel Martin"}, {"name": "Peter DeCarlo"}]
      }
    },
    "response": {
      "speaker_identification": {
        "mapping": {
          "A": "Michel Martin",
          "B": "Peter DeCarlo"
        },
        "status": "success"
      }
    }
  },
  "utterances": [
    {
      "speaker": "Michel Martin",
      "text": "Smoke from hundreds of wildfires in Canada is triggering air quality alerts...",
      "start": 240,
      "end": 26560,
      "confidence": 0.9815734,
      "words": [
        {
          "text": "Smoke",
          "start": 240,
          "end": 640,
          "confidence": 0.90152997,
          "speaker": "Michel Martin"
        }
        // ... more words
      ]
    }
    // ... more utterances
  ]
}

Response fields

Key	Type	Description
`speech_understanding.response.speaker_identification.mapping`	object	A mapping of the original generic speaker labels (e.g., “A”, “B”) to the identified speaker names or roles.
`speech_understanding.response.speaker_identification.status`	string	The status of the speaker identification request (e.g., “success”).
`utterances`	array	A turn-by-turn temporal sequence of the transcript, where the i-th element is an object containing information about the i-th utterance in the audio file.
`utterances[i].confidence`	number	The confidence score for the transcript of this utterance.
`utterances[i].end`	number	The ending time, in milliseconds, of the utterance in the audio file.
`utterances[i].speaker`	string	The identified speaker name or role for this utterance.
`utterances[i].start`	number	The starting time, in milliseconds, of the utterance in the audio file.
`utterances[i].text`	string	The transcript for this utterance.
`utterances[i].words`	array	A sequential array for the words in the transcript, where the j-th element is an object containing information about the j-th word in the utterance.
`utterances[i].words[j].text`	string	The text of the j-th word in the i-th utterance.
`utterances[i].words[j].start`	number	The starting time for when the j-th word is spoken in the i-th utterance, in milliseconds.
`utterances[i].words[j].end`	number	The ending time for when the j-th word is spoken in the i-th utterance, in milliseconds.
`utterances[i].words[j].confidence`	number	The confidence score for the transcript of the j-th word in the i-th utterance.
`utterances[i].words[j].speaker`	string	The identified speaker name or role who uttered the j-th word in the i-th utterance.

With Speaker Identification, the speaker field in utterances and words contains the identified name or role (e.g., "Michel Martin" or "Agent") instead of generic labels like "A", "B", "C". All other fields (text, start, end, confidence, words) remain unchanged from the standard transcription response.

​Overview

​Choosing how to identify speakers

​How to use Speaker Identification

​Identify by name

​Identify by role

​Common role combinations

​Controlling Effort

​When to use low/default effort

​When to use medium effort

​Apply to an existing transcript

​Adding speaker metadata

​API reference

​Request

​In a transcription request

​On an existing transcript

​Request parameters

​Response

​Response fields

Overview

Choosing how to identify speakers

How to use Speaker Identification

Identify by name

Identify by role

Common role combinations

Controlling Effort

When to use low/default effort

When to use medium effort

Apply to an existing transcript

Adding speaker metadata

API reference

Request

In a transcription request

On an existing transcript

Request parameters

Response

Response fields