July 17, 2025

Top 8 speaker diarization libraries and APIs in 2025

AI Concepts

Speaker Diarization

Kelsey Foster

Growth

Kelsey Foster

Growth

Reviewed by

No items found.

Table of contents

[Visible on live site]

In its simplest form, speaker diarization answers the question: who spoke when?

In the field of Automatic Speech Recognition (ASR), speaker diarization refers to (A) the number of speakers that can be automatically detected in an audio file, and (B) the words that can be assigned to the correct speaker in that file.

Today, many modern Speech-to-Text APIs and speaker diarization libraries apply advanced AI models to perform tasks (A) and (B) near human-level accuracy, significantly increasing the utility of speaker diarization APIs. Recent advances in 2025 have dramatically improved performance in challenging real-world conditions, with updates like AssemblyAI's new speaker embedding model achieving documented improvements of 30% in noisy environments.

In this blog post, we'll look at how speaker diarization works, why it's useful, some of its current limitations, and the top eight speaker diarization libraries and APIs for product teams and developers to use. We'll examine the different approaches available and help you choose the right solution for your specific needs.

What is speaker diarization?

Speaker diarization answers the question: "Who spoke when?" It involves segmenting and labeling an audio stream by speaker, allowing for a clearer understanding of who is speaking at any given time. This process is essential for automatic speech recognition (ASR), meeting transcription, call center analytics, and more.

Speaker Diarization performs two key functions:

Speaker Detection: Identifying the number of distinct speakers in an audio file.
Speaker Attribution: Assigning segments of speech to the correct speaker.

The result is a transcript where each segment of speech is tagged with a speaker label (e.g., "Speaker A," "Speaker B"), making it easy to distinguish between different voices. This improves the readability of transcripts and increases the accuracy of analyses that depend on understanding who said what.

How does speaker diarization work?

The fundamental task of speaker diarization is to apply speaker labels (i.e., "Speaker A," "Speaker B," etc.) to each utterance in the transcription text of an audio/video file.

Accurate speaker diarization requires many steps. The first step is to break the audio file into a set of "utterances." What constitutes an utterance? Generally, utterances are at least a half second to 10 seconds of speech. To illustrate this, let's look at the below examples:

Utterance 1:

Hello my name is Bob.

Utterance 2:

I like cats and live in New York City.

In the same way that a single word wouldn't be enough for a human to identify a speaker without prior knowledge, AI models also need more data to identify speakers too. This is why the first step is to break the audio file into a set of "utterances" that can, later, be assigned to a specific speaker (e.g., "Speaker A" spoke "Utterance 1").

There are many ways to break up an audio/video file into a set of utterances, with one common way being to use silence and punctuation markers. In our research, we start seeing a drop off in a Speaker Diarization model's ability to correctly assign an utterance to a speaker when utterances are less than one second.

Once an audio file is broken into utterances, those utterances get sent through a deep learning model that has been trained to produce "embeddings" that are highly representative of a speaker's characteristics. An embedding is a deep learning model's low-dimensional representation of an input. For example, the image below shows what the embedding of a word looks like:

We perform a similar process to convert not words, but segments of audio, into embeddings as well.

Next, we need to determine how many speakers are present in the audio file--this is a key feature of a modern speaker diarization model. Legacy speaker diarization systems required knowing how many speakers were in an audio/video file ahead of time, but a major benefit of modern speaker diarization models is that they can accurately predict this number.

Our first goal here is to overestimate the number of speakers. Using a clustering method, you want to determine the greatest number of speakers that could reasonably be heard in the audio. Why overestimate? It's much easier to combine the utterances of one speaker that has been incorrectly identified as two than it is to disentangle the utterances of two speakers which have incorrectly been combined into one.

After this initial step, we go back and combine speakers, or disentangle speakers, as needed to get an accurate number.

Finally, speaker diarization models take the utterance embeddings (produced above), and cluster them into as many clusters as there are speakers. For example, if a speaker diarization model predicts there are four speakers in an audio file, the embeddings will be forced into four groups based on the "similarity" of the embeddings.

For example, in the below image, let's assume each dot is an utterance. The utterances get clustered together based on their similarity — with the idea being that each cluster corresponds to the utterances of a unique speaker.

There are many ways to determine similarity of embeddings, and this is a core component of accurately predicting speaker labels with a speaker diarization model. After this step, you now have a transcription complete with accurate speaker labels!

Recent advances have introduced end-to-end approaches that eliminate traditional pipeline stages by treating diarization as a unified problem. These newer architectures can better handle overlapping speech and brief utterances that previously challenged traditional systems.

Today's speaker diarization models can be used to determine multiple speakers in the same audio/video file with high accuracy. It's important to note that speaker diarization differs from speaker recognition — diarization identifies different speakers without knowing their identities, while recognition matches voices to known individuals.

Speaker diarization example process

To illustrate this process, let's consider an example:

Audio Segmentation: An audio file containing a conversation is segmented into utterances based on pauses and punctuation.
Feature Extraction: Each utterance is processed by an AI model to create embeddings that represent the unique vocal characteristics of each speaker.
Clustering: The embeddings are clustered into groups based on the proximity in the embedding space. Each cluster is expected to correspond to one person.
Speaker Attribution: The utterances within each cluster are labeled with the same speaker tags, and these tags are used to annotate the transcript.

Why is speaker diarization useful?

Speaker diarization is useful because it takes a big wall of text and breaks it into something much more understandable and valuable. If you were to try and read a transcription without speaker labels, your brain would automatically try and assign each word/sentence to the appropriate speaker. Speaker diarization saves you time and mental energy.

For example, let's look at the before and after transcripts below with and without speaker diarization:

Without speaker diarization: [Example of wall of text]

With speaker diarization: [Example with speaker labels]

See how much easier the transcription is to read with speaker diarization?

Speaker diarization is also a powerful analytics tool. By identifying and labeling speakers, product teams and developers can analyze each speaker's behaviors, identify patterns/trends among individual speakers, make predictions, and more. For example:

A call center might analyze agent/customer calls or customer requests or complaints, to identify trends that could help facilitate better communication.
A podcast service might use speaker labels to identify the <host> and <guest>, making transcriptions more readable for end users.
A telemedicine platform might identify <doctor> and <patient> to create an accurate transcript, attach a readable transcript to patient files, or input the transcript into an EHR system.

Enterprises are already leveraging these capabilities to transform their operations and create powerful transcription and analysis tools.

Additional benefits of speaker diarization

Speaker diarization doesn't just make transcripts easier to read—it adds significant value across various applications. Here are some additional benefits:

Better readability: Clearly identified speakers make transcripts easier to follow, reducing cognitive load and improving comprehension for readers.
Improved meeting records: In business settings, detailed speaker labels help track contributions, follow up on action items, and guarantee accountability.
Legal compliance: Accurate identification of speakers in legal proceedings can be critical for maintaining accurate records and guaranteeing all parties are properly represented.
Enhanced customer insights: In customer service and support, analyzing speaker interactions can reveal insights into customer sentiment, agent performance, and areas for process improvement.
Training and development: For training purposes, speaker-labeled transcripts can help trainers identify areas where employees may need further development or highlight best practices from top performers.
Content creation: For media and entertainment, speaker labels make it easier to edit and produce content, as editors can quickly locate and differentiate between speakers.
Educational use: In educational settings, speaker-labeled transcripts can help students follow along with lectures more easily and review material more effectively.
Research and analysis: For researchers, being able to distinguish between different speakers can provide deeper insights into conversational dynamics and interaction patterns.

Top 8 speaker diarization libraries and APIs

Choosing the right speaker diarization solution depends on your specific needs: accuracy requirements, processing speed, integration complexity, and whether you need a managed API or open-source flexibility. Here's our comprehensive analysis of the leading solutions:

Speaker Diarization Libraries and APIs Comparison

Top 8 Speaker Diarization Libraries and APIs in 2025

Provider	Type	Key Strength	Best For
AssemblyAI	API	30% better in noisy conditions; 43% recent accuracy improvement for 250ms segments; 2.9% speaker count error rate	Production applications requiring high accuracy in real-world conditions
Deepgram	API	10X faster processing according to their benchmarks	Applications prioritizing processing speed
Speechmatics	API	25% accuracy improvement per their testing; flexible deployment	Enterprise deployments with on-premise requirements
Gladia	API	Whisper + PyAnnote integration	Teams using Whisper who need diarization
PyAnnote	Open Source	State-of-the-art research model	Research and custom implementations
NVIDIA NeMo	Open Source	Sortformer E2E architecture	Advanced research and multi-speaker ASR
Kaldi	Open Source	Highly configurable research platform	Academic research
SpeechBrain	Open Source	PyTorch-based with 200+ recipes	Research and prototyping

Let's examine each solution in detail:

1. AssemblyAI

AssemblyAI is a leading speech recognition startup that offers Speech-to-Text transcription with high accuracy, in addition to offering speech understanding features such as sentiment analysis, Topic Detection, summarization, entity detection, and more.

AssemblyAI's speaker diarization has seen dramatic improvements in 2024-2025, achieving a 10.1% improvement in Diarization Error Rate (DER) and 13.2% improvement in cpWER. Based on extensive testing, the service demonstrates 30% better performance in noisy environments and handles speaker segments as short as 250ms with 43% improved accuracy compared to previous versions.

Key improvements:

Industry-leading 2.9% speaker count error rate (based on internal benchmarking)
Enhanced handling of similar voices
Support for 16 languages
Improved timestamp accuracy through Universal-2 integration
Pricing: $0.37/hour

Its Universal speech-to-text API includes an option for speaker diarization. Simply enable speaker_labels=true in your API call to access these capabilities at no additional cost.

import requests
import time

base_url = "https://api.assemblyai.com"

headers = {
    "authorization": "<YOUR_API_KEY>"
}

with open("./my-audio.mp3", "rb") as f:
  response = requests.post(base_url + "/v2/upload",
                          headers=headers,
                          data=f)

upload_url = response.json()["upload_url"]

data = {
    "audio_url": upload_url, # You can also use a URL to an audio
or video file on the web
    "speaker_labels": True
}

url = base_url + "/v2/transcript"
response = requests.post(url, json=data, headers=headers)

transcript_id = response.json()['id']
polling_endpoint = base_url + "/v2/transcript/" + transcript_id

while True:
  transcription_result = requests.get(polling_endpoint,
headers=headers).json()

  if transcription_result['status'] == 'completed':
    print(f"Transcript ID:", transcript_id)
    break

  elif transcription_result['status'] == 'error':
    raise RuntimeError(f"Transcription failed: 
{transcription_result['error']}")

  else:
    time.sleep(3)

for utterance in transcription_result['utterances']:
  print(f"Speaker {utterance['speaker']}: {utterance['text']}")

‍

For a complete implementation guide, see our Python speaker diarization tutorial.

Test speaker diarization

Try AssemblyAI speaker diarization model for free in our no-code playground

Test now

2. Deepgram

Deepgram's diarization feature emphasizes processing speed, claiming to process audio 10X faster than the nearest competitor according to their benchmarks. Their latest model shows a 53.1% improvement from their previous version and operates in a language-agnostic manner.

Key features:

No limit on number of speakers
Language-agnostic operation
Focus on processing speed (10X faster per their benchmarks)
Integrated with their Nova-2 speech recognition model

Deepgram's speed-first approach makes it suitable for applications where rapid processing is the primary requirement.

3. Speechmatics

Speechmatics claims to be 25% ahead of their closest competitor in accuracy according to their benchmarks. They offer speaker diarization through their Flow platform with both cloud and on-premise deployment options.

Key features:

Enhanced accuracy through punctuation-based corrections
Configurable maximum speakers (2-20)
Support for 30+ languages
Processing time increase of 10-50% when diarization is enabled (per their documentation)

Speechmatics provides deployment flexibility for enterprise environments that require on-premise options or have specific compliance needs.

4. Gladia

Gladia combines Whisper's transcription capabilities with PyAnnote's diarization, providing an integrated solution for developers using Whisper. Their enhanced diarization option includes additional processing for edge cases.

Key features:

Whisper + PyAnnote integration
Enhanced diarization mode for challenging audio
Configurable speaker hints
Streaming support available

This integration provides a pathway for teams already using Whisper to add speaker diarization capabilities without managing multiple services.

5. PyAnnote

PyAnnote is a widely-used open-source speaker diarization toolkit, now in version 3.1. It achieves approximately 10% DER with optimized configurations on standard benchmarks and processes with a 2.5% real-time factor on GPU.

Recent improvements include better handling of overlapping speech and enhanced speaker embeddings. PyAnnote serves as the foundation for several commercial solutions, including Gladia.

Key considerations:

Requires training or fine-tuning for optimal performance on specific use cases
Supports Python 3.7+ on Linux and MacOS
Requires Hugging Face authentication token for pre-trained model access
Active research community and regular updates

PyAnnote is well-suited for research projects and teams with ML expertise who need customizable diarization solutions.

6. NVIDIA NeMo

NVIDIA NeMo introduces Sortformer, an end-to-end diarization approach using an 18-layer Transformer architecture. This innovative design eliminates traditional pipeline stages by treating diarization as a unified problem.

Key features:

End-to-end neural architecture
Multi-scale diarization decoder (MSDD)
Seamless ASR integration
GPU-optimized processing

NeMo's speaker diarization pipeline includes:

Voice Activity Detector (VAD) for speech detection
Speaker Embedding Extractor for vocal characteristics
Clustering Module for speaker grouping

The system supports both oracle VAD (using ground-truth timestamps) and system VAD (using model-generated timestamps). NeMo is designed for researchers and teams building custom multi-speaker ASR systems who have access to GPU resources and ML expertise.

7. Kaldi

Kaldi is a speech recognition toolkit widely used in academic research that includes speaker diarization capabilities. It offers extensive customization options and has been the foundation for many speech processing research projects.

With Kaldi, users can either:

Train models from scratch with full control over the pipeline
Use pre-trained X-Vectors network or PLDA backend from the Kaldi website

Getting started with Kaldi requires understanding its unique architecture and recipe-based approach. This Kaldi tutorial provides an introduction to the framework. For speaker diarization specifically, this tutorial covers the implementation details.

Kaldi is best suited for academic research and teams with speech processing expertise who need maximum flexibility and control over their diarization pipeline.

8. SpeechBrain

SpeechBrain is a PyTorch-based toolkit offering over 200 recipes for various speech tasks, including speaker diarization. It provides both pre-trained models and training frameworks for researchers and developers.

Key features:

Extensive recipe collection covering 20+ speech tasks
PyTorch-based architecture for easy integration
Modular design allowing component customization
Active development community and regular updates

The toolkit includes features like dynamic batching, mixed-precision training, and support for single and multi-GPU training. SpeechBrain aims to bridge research and production by providing structured recipes that can be adapted for specific use cases.

It's particularly suitable for teams familiar with PyTorch who want to experiment with different diarization approaches or need to customize their pipeline for specific requirements.

How to choose a speaker diarization solution

For production applications:

High accuracy in noisy conditions → AssemblyAI's 30% improvement in real-world audio and 43% improvement on short segments (250ms)
Whisper ecosystem → Gladia provides integrated Whisper + diarization
Enterprise deployment flexibility → Speechmatics offers both cloud and on-premise options

For Conversation intelligence use cases: Consider your specific requirements:

Conference room recordings: Look for solutions tested on noisy, multi-speaker environments
Call center analytics: Accuracy on brief utterances and speaker count precision are critical
Meeting transcription: Real-time capabilities and handling of overlapping speech matter
Interview processing: Clear speaker separation and accurate timestamps are essential

AssemblyAI's recent improvements specifically target these real-world scenarios with documented performance gains in noisy conditions (30% improvement) and very short utterances (43% improvement at 250ms).

For Research/Development:

Maximum control → NVIDIA NeMo's Sortformer architecture for custom implementations
Established framework → PyAnnote 3.1 with pre-trained models and active community
Academic benchmarking → Kaldi's extensive configuration options
PyTorch ecosystem → SpeechBrain's recipe collection

Open-source options require significant time investment and technical expertise. Consider your team's ML experience and timeline when choosing between managed APIs and open-source solutions.

Limitations of speaker diarization

While speaker diarization has improved dramatically, some limitations remain. Some speaker diarization systems only work for asynchronous transcription, though modern solutions like AssemblyAI support both async and real-time streaming with speaker diarization capabilities. Additionally, it's worth noting that multichannel transcription can be an alternative to diarization when speakers are recorded on separate channels.

There are also several constraints that limit the accuracy of modern speaker diarization models:

Speaker talk time
Conversational pace

A speaker's talk time has the biggest impact on accuracy. If a speaker talks for less than 15 seconds in an entire audio file, it's a toss-up as to if a Speaker Diarization model will correctly identify this speaker as a unique, individual speaker. If it cannot, two outcomes may occur: the model may assign the speaker as <unknown>, or it may merge their words with a more dominant speaker. Generally speaking, a speaker has to talk for more than 30 seconds in order to accurately be detected by a Speaker Diarization model.

The pace and type of a communication have the second biggest impact on accuracy, with conversational communication being the easiest to accurately diarize. If the conversation is well-defined, with each speaker taking clear turns (think of a Joe Rogan podcast versus a phone call conversation with your best friend), has an absence of over-talking or interrupting, and minimal background noise, it is much more likely that the model will correctly label each speaker.

However, if the conversation is more energetic, with the speakers cutting each other off or speaking over one another, or has significant background noise, the model's accuracy will decrease. If overtalk (aka crosstalk) is common, the model may even misidentify an imaginary third speaker, which includes the portions of overtalk.

These challenging conditions - noisy environments, overlapping speech, and similar voices - are where speaker diarization accuracy varies most between different solutions. Recent advances in the field are specifically targeting these real-world challenges.

Recent advances are addressing these challenges. Enhanced models now better handle brief acknowledgments, overlapping speech, and background noise, though perfect accuracy in all conditions remains elusive. While there are clearly some limitations to speaker diarization today, advances in deep learning research are helping to overcome these deficiencies and to boost speaker diarization accuracy.

For teams building production applications, it's important to evaluate how different solutions perform on your specific audio conditions. Solutions optimized for clean audio may struggle with real-world recordings, while those designed for challenging conditions (like AssemblyAI's 30% improvement in noisy environments) can provide more consistent results across varying audio quality.

Build live transcription solutions with AssemblyAI

AssemblyAI provides production-ready speaker diarization with documented performance improvements in challenging audio conditions. By integrating our speaker diarization and other advanced Speech AI features, organizations gain:

Proven accuracy: Industry-leading 2.9% speaker count error rate and 30% improvement in noisy conditions based on extensive testing
Real-world performance: Specific optimizations for challenging audio environments, including 43% improvement on brief utterances (250ms)
Simple implementation: Straightforward API integration with comprehensive documentation and SDKs
Comprehensive features: From speaker diarization and custom vocabulary to auto punctuation and confidence scores
Enterprise security: SOC2 Type 2 certified with enterprise-grade security practices
Continuous improvements: Regular model updates based on customer feedback and research advances

Test our speaker diarization capabilities with your own audio files in the AssemblyAI API Playground. Sign up for a free account to get started with $50 in credits.

Start Building with AssemblyAI

Top 8 speaker diarization libraries and APIs in 2025

What is speaker diarization?

How does speaker diarization work?

Speaker diarization example process

Why is speaker diarization useful?

Additional benefits of speaker diarization

Top 8 speaker diarization libraries and APIs

Top 8 Speaker Diarization Libraries and APIs in 2025

1. AssemblyAI

2. Deepgram

3. Speechmatics

4. Gladia

5. PyAnnote

6. NVIDIA NeMo

7. Kaldi

8. SpeechBrain

How to choose a speaker diarization solution

Limitations of speaker diarization

Build live transcription solutions with AssemblyAI

Using multichannel and speaker diarization

Python speech recognition in 2025

What is speaker diarization and how does it work? (Complete 2025 Guide)

Speaker Diarization: Adding speaker labels for enterprise speech-to-text

Speech AI use cases for Learning Management Systems

How to build the lowest latency voice agent in Vapi: Achieving ~465ms end-to-end Latency

G2's Summer 2025 Voice Recognition Reports: AssemblyAI receives top rankings across key categories

WhatConverts | AssemblyAI Case Study

Top 8 speaker diarization libraries and APIs in 2025

What is speaker diarization?

How does speaker diarization work?

Speaker diarization example process

Why is speaker diarization useful?

Additional benefits of speaker diarization

Top 8 speaker diarization libraries and APIs

Top 8 Speaker Diarization Libraries and APIs in 2025

1. AssemblyAI

2. Deepgram

3. Speechmatics

4. Gladia

5. PyAnnote

6. NVIDIA NeMo

7. Kaldi

8. SpeechBrain

How to choose a speaker diarization solution

Limitations of speaker diarization

Build live transcription solutions with AssemblyAI

Related posts

Using multichannel and speaker diarization

Python speech recognition in 2025

What is speaker diarization and how does it work? (Complete 2025 Guide)

Speaker Diarization: Adding speaker labels for enterprise speech-to-text

Speech AI use cases for Learning Management Systems

How to build the lowest latency voice agent in Vapi: Achieving ~465ms end-to-end Latency

G2's Summer 2025 Voice Recognition Reports: AssemblyAI receives top rankings across key categories

WhatConverts | AssemblyAI Case Study