Automatic Speech Recognition

Python Speech Recognition in 2025

Learn about the different open-source libraries and cloud-based solutions you can use for speech recognition in Python.

Python Speech Recognition in 2025

If you're looking to implement Automatic Speech Recognition (ASR) in Python, you may have noticed that there is a wide array of available options. Navigating these choices can be daunting, but we’re here to guide you. Broadly, Python speech recognition and Speech-to-Text solutions can be categorized into two main types: open-source libraries and cloud-based services.

Open-source solutions consist of libraries, often hosted on platforms like GitHub, that can be integrated directly into your program. These libraries perform computations locally using your own resources, giving you full control over the process. In contrast, cloud-based solutions handle computation on remote servers and are typically accessed via API endpoints, allowing you to leverage powerful infrastructure without managing it yourself.

In this blog post, we'll show you the different open-source libraries and cloud-based solutions you can use to implement automatic speech recognition in Python. We'll also tell you everything you need to know to make a decision for your use case.

What is Speech Recognition?

Speech recognition is a technology that enables machines to recognize and convert spoken language into text. It works by analyzing audio signals, identifying patterns, and matching them to words and phrases using advanced algorithms.

Modern speech recognition systems often leverage machine learning and artificial intelligence, allowing them to handle various accents, languages, and speaking styles with impressive accuracy.

This technology is widely used in virtual assistants, transcription tools, conversational intelligence apps (which for example can extract meeting insights or provide sales and customer insights), customer service chatbots, and voice-controlled devices. By transforming spoken words into actionable digital input, speech recognition makes interactions with technology more intuitive and productive.

Open-Source vs. Cloud-Based Python Speech Recognition Solutions

When evaluating Python Speech Recognition options, it’s essential to consider the trade-offs between open-source and cloud-based solutions.

Open-Source Solutions

One of the primary advantages of open-source speech recognition solutions is that the source code is openly available, you can inspect every aspect of how the system works, i.e., what it does, how it does it, and when it executes certain functions. For advanced developers, open-source solutions also offer the flexibility to modify the code to meet specific requirements.

However, open-source solutions come with notable challenges. The computational resources required for speech recognition and the engineering infrastructure to make sure the service is reliable and fast must be provided and managed by you, either through local hardware or self-hosted cloud infrastructure. For individual developers or smaller teams without access to significant resources, this can become a considerable burden. Additionally, the accuracy of open-source tools is often inferior to that of cloud-based alternatives, which may be a critical limitation for projects where precision is a priority.

Cloud-Based Solutions

Cloud-based solutions, on the other hand, offer several distinct advantages. They can be more accurate than open-source alternatives, easier to implement, and eliminate the need to host models or manage computational resources. Many cloud platforms, such as AssemblyAI's Speech-to-Text API, provide advanced features like customizable vocabularies, paragraph detection, and speaker diarization. Some even offer free-tier options for basic usage, making them accessible to developers with limited budgets.

The primary drawbacks of cloud-based solutions are their cost and the lack of control over the underlying infrastructure and algorithms, as they are managed by the service provider. However, these drawbacks are often outweighed by the ease of implementation, higher accuracy, and reduced need for managing computational resources, making cloud-based solutions a popular choice for many developers.

Key Takeaways

When choosing between open-source and cloud-based speech recognition solutions for your Python project, consider three main factors:

  1. Accuracy: For high precision, cloud solutions are often better. However, modern open-source tools are also highly accurate, so we suggest comparing models using the latest benchmarks.
  2. Cost: Open-source tools can be more cost-effective for those with access to self-managed computational resources. However, it’s important to consider hidden costs, such as the need for high-performance hardware, potential cloud infrastructure expenses, and the time and expertise required for setup, maintenance, and customization. On the other hand, cloud-based services come with ongoing expenses, but many providers offer pay-as-you-go plans, allowing you to pay only for what you use.
  3. Ease of implementation: Cloud solutions are typically much simpler to integrate and deploy compared to open-source alternatives. While some open-source models may be easy to run, deploying them to ensure they are available, reliable, and performant is a more complex task.
  4. Full control and transparency: Open-source tools provide complete control and visibility into the code, allowing for advanced customization and ensuring compliance with specific requirements or policies.

By carefully weighing these factors, you can select the approach that best aligns with your project’s needs and constraints.

Open-Source Python Speech Recognition Options

There are many open source Python speech recognition options. We’ll cover the four most common ones here. These open source Python speech recognition libraries are Whisper, SpeechRecognition, wav2letter, and DeepSpeech.

OpenAI Whisper

Whisper, developed by OpenAI, is a versatile speech recognition model capable of tasks like transcription, multilingual processing, and handling noisy audio. It’s designed to work locally on your machine, providing flexibility without relying on cloud services.

Whisper supports multiple pre-trained models, ranging from lightweight options suitable for CPUs to larger models optimized for GPUs. Installation is simple via pip, and its robust handling of low-quality audio makes it a strong choice for challenging environments.

However, Whisper’s local processing requires significant computational resources, especially for larger models or real-time applications. While smaller models can run on CPUs, GPUs are all but necessary for serious applications.

Whisper’s biggest issues are its increased propensity for hallucinations and some notable weaknesses in real-world use cases such as proper noun detection when compared to Universal-2. But it is ideal for developers seeking an advanced, offline-capable speech recognition solution that supports multilingual tasks and can handle diverse audio conditions. It’s a powerful tool but best suited for those with the resources and expertise to self host and deploy effectively for non-personal use. 

Below is a short code example that shows how to use Whisperl. If you want to learn more, you can read this blog post on how to run OpenAI’s Whisper model.

# pip install openai-whisper

import whisper

model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3")
print(result["text"])

SpeechRecognition

At first glance, the name SpeechRecognition suggests it might be the ultimate open-source library for speech recognition. However, it’s important to note that this library is not a standalone solution but rather a wrapper for other speech recognition technologies. While it does support numerous services, including Google Cloud Speech-to-Text, CMU Sphinx, Wit.ai, Azure, Houndify, IBM Watson, and Snowboy, most of these are cloud-based. Options that can also be used offline are CMU Sphinx, Vosk, and Whisper.

A solid solution worth trying in this library is CMU Sphinx, although the creators state that active development has largely ceased, causing it to diverge very far from the state of the art. The strengths of CMU Sphinx lie in its robustness and history. It is built on over 20 years of research by Carnegie Mellon University, one of the leading institutions in computer science. Sphinx is relatively lightweight compared to other speech-to-text solutions, supports multiple languages, and offers extensive developer documentation and FAQs.

However, these strengths highlight a limitation of the SpeechRecognition library itself. It functions as a wrapper that relies on external speech recognition services to perform transcriptions, rather than offering its own native capabilities. While this integration simplifies access to multiple services within a single library, it doesn’t provide the flexibility or independence of a fully open-source solution.

This library is a practical choice if you’re experimenting with multiple cloud-based speech-to-text services and want an easy way to test their performance within a unified framework. However, if you need a robust offline solution or more advanced customization, you may want to explore other libraries tailored to your specific needs.

Here is a code example that shows how to transcribe an audio file with the Speech Recognition library:

# pip install SpeechRecognition

import speech_recognition as sr
from os import path

AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "english.wav")

r = sr.Recognizer()

with sr.AudioFile(AUDIO_FILE) as source:
    audio = r.record(source)

# recognize speech using Sphinx
try:
    print("Sphinx thinks you said " + r.recognize_sphinx(audio))
except sr.UnknownValueError:
    print("Sphinx could not understand audio")
except sr.RequestError as e:
    print("Sphinx error; {0}".format(e))

# recognize speech using Google Speech Recognition
try:
    print("Google Speech Recognition thinks you said " + r.recognize_google(audio))
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Speech Recognition service; {0}".format(e))

Wav2letter

The open-source library wav2letter, initially developed by Facebook AI Research, has since been moved and consolidated into a new open-source library called Flashlight. Despite this, it remains widely recognized by its original name, wav2letter.

What sets wav2letter apart is its unique architecture. Unlike many natural language processing (NLP) models, which were historically dominated by recurrent neural networks (RNNs) and, more recently, transformers, wav2letter is designed entirely using convolutional neural networks (CNNs). This innovative approach spans both acoustic modeling and language modeling, making it a distinctive option in the field of speech recognition.

However, wav2letter does come with challenges that may deter less experienced developers. Unlike many Python libraries that can be installed with a simple pip command, wav2letter requires manual building. This involves the use of a C++ compiler, adding a layer of complexity to the installation process. Additionally, since its transition to Flashlight, the Flashlight library is now a required dependency, further complicating setup. But if you don’t mind the hands-on effort to configure it, you can start by reading this guide on how to install wav2letter.

DeepSpeech

DeepSpeech  is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Speech research paper. It offers a powerful and flexible speech recognition library. One of DeepSpeech’s standout features is its ability to run offline and on a wide range of devices, from resource-constrained hardware like Raspberry Pi to high-performance GPUs commonly used in model training.

Getting started with DeepSpeech is relatively straightforward. The library can be installed using pip, and audio files can be processed with minimal setup, as shown in the provided documentation or code snippets. Additionally, while DeepSpeech offers pre-trained English models, users are not restricted to them. You can use custom models tailored to your specific language.

However, there are key considerations for using DeepSpeech. Since it operates on-device, it requires significant local computational resources, particularly for training or running models at scale. For users intending to train models on a GPU, you must have the required CUDA dependencies.

DeepSpeech is an excellent choice for developers seeking to perform speech recognition locally without relying on cloud services. However, it is best suited for advanced programmers who are comfortable with customization and have the technical expertise to manage dependencies and optimize performance for on-device usage. For those ready to dive deeper into speech recognition, DeepSpeech offers a robust platform for building sophisticated, offline-capable applications.

To install and use DeepSpeech you can use these commands (see documentation):

# Install DeepSpeech
pip install deepspeech

# Download pre-trained English model files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer

# Download example audio files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/audio-0.9.3.tar.gz
tar xvf audio-0.9.3.tar.gz

# Transcribe an audio file
deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio example.wav

Cloud Python Speech Recognition

AssemblyAI offers a fast, cloud-hosted speech recognition API that is free for developers to use and packed with features. In this section, we’ll explore how to use the AssemblyAI API for transcription, speaker diarization, transcribing different languages, and extracting sentences and paragraphs from transcribed text.

To follow along with this tutorial, you’ll need a free API key from AssemblyAI. Once you sign up, your API key will be accessible in your account settings.

Using AssemblyAI’s Speech-to-Text API

The AssemblyAI Speech-to-Text API is known for its speed, accuracy, and ease of use. The API currently serves Universal-2 as default model, the most accurate model for industry use cases.

It also offers strong data privacy and security measures, making it a reliable choice for developers. The simplest way to integrate it into your workflow is by using the AssemblyAI Python SDK.

# pip install assemblyai

import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"

transcriber = aai.Transcriber()

# You can use a local filepath:
# audio_file = "./example.mp3"

# Or use a publicly-accessible URL:
audio_file = https://assembly.ai/sports_injuries.mp3"

transcript = transcriber.transcribe(audio_file)
print(transcript.text)

To transcribe audio using AssemblyAI’s Python SDK, start by installing the SDK with pip install assemblyai. Once installed, import the module and set your API key by assigning it to aai.settings.api_key, ensuring you replace YOUR_API_KEY with your actual key from AssemblyAI. Next, create an instance of the Transcriber class, which will handle the transcription process.

You can provide either a local audio file or a publicly accessible URL as the input. In this example, a sample URL is used for simplicity. The transcribe method of the Transcriber object processes the audio and returns the transcription, which is stored in a variable called transcript. Finally, you can print the text of the transcription using print(transcript.text).

This streamlined approach allows you to convert audio to text quickly, with minimal setup and just a few lines of code.

Speaker Diarization with AssemblyAI’s Speech Recognition API

Speaker Diarization can be included in your transcription by enabling an additional API parameter. This feature allows the model to identify multiple speakers in an audio file and attribute specific portions of the transcription to each speaker.

To activate Speaker Diarization, create a TranscriptionConfig object and set the speaker_labels parameter to True. This ensures that the transcript includes speaker-specific annotations, making it easier to analyze conversations or multi-speaker audio.

config = aai.TranscriptionConfig(speaker_labels=True)

transcript = aai.Transcriber().transcribe(audio_file, config)

for utterance in transcript.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.text}")

If you enable Speaker Diarization, the resulting transcript returns a list of utterances, where each utterance corresponds to an uninterrupted segment of speech from a single speaker. Here is an example of a transcribed podcast episode:

Speaker A: Welcome to the Huberman Lab podcast, where we discuss science and science based tools for everyday...
Speaker B: I'm glad to be here. It's amazing.
Speaker A: I'm a time consumer of your content. I've learned a tremendous amount about fitness, both in the...
Speaker B: I think it's like a 60 40 split, which would be leaning towards weight training strength, and then the...
Speaker A: And in terms of the duration of those workouts, what's your suggestion? I've been weight training...

Speech Recognition for multiple languages in Python with AssemblyAI

AssemblyAI’s API supports speech recognition for 99 languages. Other languages can be transcribed using either the automatic language detection feature or by manually setting the language code.

With automatic language detection, AssemblyAI identifies the dominant language spoken in the audio file and uses it during the transcription. To enable it, set language_detection to True in the transcription config.

config = aai.TranscriptionConfig(language_detection=True)

transcript = aai.Transcriber().transcribe(audio_file, config)
print(transcript.text)

Alternatively, if you already know the dominant language, you can use the language_code key to specify the language of the speech in your audio file.

config = aai.TranscriptionConfig(language_code="es")

transcript = aai.Transcriber().transcribe(audio_file, config)
print(transcript.text)

Getting Paragraphs and Sentences from AssemblyAI’s Speech Recognition API

AssemblyAI transcripts can be automatically segmented into paragraphs or sentences, creating a more reader-friendly format. The text of the transcript is broken down into either paragraphs or sentences, along with additional metadata such as start and end timestamps or speaker information.

To retrieve a list of all sentences and paragraphs, call the get_sentences() or get_paragraphs() function on the retrieved transcript object:

sentences = transcript.get_sentences()
for sentence in sentences:
    print(f"{sentence.start}: {sentence.text}")

paragraphs = transcript.get_paragraphs()
for paragraph in paragraphs:
    print(f"{paragraph.start}: {paragraph.text}")

Real-Time Speech Recognition in Python

AssemblyAI’s Speech-to-Text API also offers real-time streaming capabilities, enabling the transcription of live audio streams with high accuracy and low latency. This feature is particularly valuable for applications such as live captioning, real-time meeting transcription, conversational analytics, and voice-controlled systems.

By streaming audio data to AssemblyAI’s secure WebSocket API, you can receive transcription results within just a few hundred milliseconds, making it ideal for use cases where immediate feedback is crucial. This functionality also supports advanced features such as speaker diarization, custom vocabulary, and end-of-utterance controls, ensuring even greater flexibility and precision in live transcription scenarios.

Below is a minimal code example that shows how to transcribe streaming audio from a microphone in Python. For a detailed walkthrough, see the documentation. Please note that Streaming is currently not available in the free tier and users have to set up billing to access it.

import assemblyai as aai
aai.settings.api_key = "YOUR_API_KEY"

def on_data(transcript: aai.RealtimeTranscript):
    if not transcript.text: return
    if isinstance(transcript, aai.RealtimeFinalTranscript):
        print(transcript.text, end="\r\n")
    else:
        print(transcript.text, end="\r")
        
def on_error(error: aai.RealtimeError):
    print("An error occured:", error)
    
transcriber = aai.RealtimeTranscriber(
    sample_rate=16_000,
    on_data=on_data,
    on_error=on_error,
)

transcriber.connect()

microphone_stream = aai.extras.MicrophoneStream(sample_rate=16_000)
transcriber.stream(microphone_stream)

transcriber.close()

A Summary of the State of Python Speech Recognition in 2025

In 2025, Python offers a diverse landscape of speech recognition solutions. In this blog post, we presented several open-source solutions as well as the AssemblyAI API for a cloud-based solution.

Among open-source libraries, Whisper stands out for its accuracy and usability, while DeepSpeech offers robust offline capabilities for advanced developers willing to invest in customization. wav2letter (now part of Flashlight) appeals to those intrigued by convolutional neural network-based architectures but comes with significant setup challenges. SpeechRecognition, though widely used and great for experimentation, is more of a wrapper for other technologies and lacks true standalone functionality beyond limited offline options.

For cloud-based solutions, AssemblyAI delivers a fast, accurate, and feature-rich Speech-to-Text API. With capabilities like multi-language speech recognition support, speaker diarization, and real-time streaming, it simplifies complex transcription workflows with an easy-to-use Python SDK.

Ultimately, the choice between open-source and cloud-based options depends on your specific requirements for cost, accuracy, customization, and ease of implementation. Whether you’re building lightweight prototypes or scaling advanced production applications, Python continues to provide excellent tools for integrating speech recognition into your projects.

If you want more guidance on selecting a Python speech recognition library, read these blog posts about How to Choose the Best Speech-to-Text API and The top free Speech-to-Text APIs, AI Models, and Open Source Engines.