Build & Learn
September 4, 2025

Real-time transcription in Python with Universal-Streaming

Learn how to build real-time voice applications with AssemblyAI's Universal-Streaming model.

Ryan O'Connor
Senior Developer Educator
Ryan O'Connor
Senior Developer Educator
Reviewed by
No items found.
No items found.
No items found.
No items found.
Table of contents

Learn how to build real-time voice applications with AssemblyAI's Universal-Streaming model.

Real-time transcription allows you to transcribe audio as it is generated, rather than submitting a complete audio file for transcription as with asynchronous transcription. Using Universal-Streaming, you can build voice agents, automated subtitles for live speeches, real-time meeting transcription, and interactive voice applications with industry-leading accuracy and ~300ms latency, which performance benchmarks show is up to 41% faster than some competing solutions.

In this tutorial, we'll learn how to perform real-time transcription in Python using AssemblyAI's Universal-Streaming model. We'll cover the complete implementation process, from environment setup through advanced configuration, along with troubleshooting tips to help you build production-ready voice applications.

What is real-time transcription and when to use it

Real-time transcription processes audio as it's spoken, converting it to text with minimal delay. This is different from asynchronous, or batch, transcription where you submit a complete audio file and wait for the full transcript to be returned.

Choose real-time transcription when your application depends on low-latency feedback. For example, voice agents need to understand and respond to a user immediately to maintain a natural conversational flow, and as industry adoption trends indicate, businesses are rapidly adopting this technology to improve efficiency and customer experience. Other common applications include live captioning for events, real-time meeting notes, and interactive voice response (IVR) systems.

Getting started

For this tutorial, we'll be using AssemblyAI's Universal-Streaming model, which delivers immutable transcripts with ~300ms latency and intelligent endpointing designed specifically for voice agents.

You'll need an API key, so get one for free here if you don't already have one.

Universal-Streaming requires session-based pricing at $0.15/hour. Free accounts are limited to 5 concurrent sessions, while paid accounts have a default of 100 that automatically scales with usage.

Setting up the virtual environment

We'll use the AssemblyAI Python SDK for this tutorial, which provides high-level functions for interacting with Universal-Streaming. To install it, first create a directory and virtual environment for this project:

mkdir universal-streaming-demo && cd universal-streaming-demo
python -m venv venv

Next, activate the virtual environment.

On MacOS/Linux:

source ./venv/bin/activate

On Windows:

.\venv\Scripts\activate.bat

Install the portaudio system dependency:

Now, install the SDK and the additional extras:

pip install "assemblyai[extras]"

The extras contain additional packages for real-time transcription functionality, like getting the audio stream from the microphone.

Setting up the environment file

The AssemblyAI Python SDK requires your API key to be stored in an environment variable called ASSEMBLYAI_API_KEY. Create a file called .env in your project directory and add your API key:

ASSEMBLYAI_API_KEY=your-key-here

Get your free API key

Create an AssemblyAI account to generate your ASSEMBLYAI_API_KEY and start building with Universal-Streaming in Python.

Start building free

Important: Never share this file or check it into source control. Create a .gitignore file to prevent accidental commits:

.env
venv

How to perform real-time transcription with Universal-Streaming

Universal-Streaming uses WebSocket connections to provide ultra-fast, immutable transcripts. Unlike traditional streaming models that provide partial and final transcripts, Universal-Streaming delivers immutable transcripts that won't change once emitted, making them immediately ready for downstream processing in voice agents.

Understanding Universal-Streaming responses

Universal-Streaming uses Turn objects for immutable transcriptions. Each Turn represents a single speaking turn with these properties:

  • turn_order: Integer that increments with each new turn
  • transcript: String containing only finalized words
  • end_of_turn: Boolean indicating if this is the end of the current turn
  • turn_is_formatted: Boolean indicating if the text includes punctuation and formatting
  • end_of_turn_confidence: Float (0-1) representing confidence that the turn has finished

Event handlers

We need to define event handlers for different types of events during the streaming session.

Create a file called main.py and add the following imports and event handlers:

import assemblyai as aai
from typing import Type
from dotenv import load_dotenv
import os
from assemblyai.streaming.v3 import (
   BeginEvent,
   StreamingClient,
   StreamingClientOptions,
   StreamingError,
   StreamingEvents,
   StreamingParameters,
   StreamingSessionParameters,
   TerminationEvent,
   TurnEvent,
)

load_dotenv()

api_key = os.getenv('ASSEMBLYAI_API_KEY')

def on_begin(self: Type[StreamingClient], event: BeginEvent):
   print(f"Session started: {event.id}")

def on_turn(self: Type[StreamingClient], event: TurnEvent):
   print(f"{event.transcript} ({event.end_of_turn})")

   if event.end_of_turn and not event.turn_is_formatted:
       params = StreamingSessionParameters(
           format_turns=True,
       )
       self.set_params(params)

def on_terminated(self: Type[StreamingClient], event: TerminationEvent):
   print(
       f"Session terminated: {event.audio_duration_seconds} seconds of audio processed"
   )

def on_error(self: Type[StreamingClient], error: StreamingError):
   print(f"Error occurred: {error}")

Create and run the streaming client

Now add the main script code to create and run the Universal-Streaming client:

def main():
   client = StreamingClient(
       StreamingClientOptions(
           api_key=api_key,
           api_host="streaming.assemblyai.com"
       )
   )

   client.on(StreamingEvents.Begin, on_begin)
   client.on(StreamingEvents.Turn, on_turn)
   client.on(StreamingEvents.Termination, on_terminated)
   client.on(StreamingEvents.Error, on_error)

   client.connect(
       StreamingParameters(
           sample_rate=16000,
           format_turns=True,
       )
   )

   try:
       client.stream(
           aai.extras.MicrophoneStream(sample_rate=16000)
       )
   finally:
       client.disconnect(terminate=True)

if __name__ == "__main__":
   main()

Running the script

With your virtual environment activated, run the script:

python main.py

You'll see your session ID printed when the connection starts. Expected behavior:

  • Immutable transcripts appear in real-time as you speak
  • Final transcripts include punctuation and formatting after speech ends
  • Press Ctrl+C to terminate the session

Advanced configuration options

Universal-Streaming offers several configuration options to optimize for your specific use case:

Intelligent endpointing

Configure end-of-turn detection to handle natural conversation flows:

client.connect(
   StreamingParameters(
       sample_rate=16000,
       end_of_turn_confidence_threshold=0.8,
       min_end_of_turn_silence_when_confident=500, # milliseconds
       max_turn_silence=2000,  # milliseconds
   )
)

Text formatting control

Control whether you receive formatted transcripts:

client.connect(
   StreamingParameters(
       sample_rate=16000,
       format_turns=True
   )
)

Authentication tokens

For client-side applications, use temporary authentication tokens to avoid exposing your API key. This is a critical security measure, as a recent report found that over 30% of product leaders see data privacy as a significant challenge when implementing speech recognition. First, on the server-side, use your API key to generate the temporary token:

# Generate a temporary token (do this on your server)
client = StreamingClient(
   StreamingClientOptions(
       api_key=api_key,
       api_host="streaming.assemblyai.com"
   )
)

token = client.create_temporary_token(expires_in_seconds=60, max_session_duration_seconds=3600)

Then on the client-side, initialize the StreamingClient with the token parameter instead of the API key:

client = StreamingClient(
   StreamingClientOptions(
       token=token,
       api_host="streaming.assemblyai.com"
   )

Troubleshooting common implementation issues

When working with real-time audio streaming, you might encounter a few common issues. Here's how to handle them.

  • WebSocket Connection Errors: The SDK raises a StreamingError with specific codes. Common codes include:
    • 1008: Unauthorized connection. This can be due to an invalid API key, insufficient account balance, or exceeding your concurrency limits.
    • 3005: Session expired. This can happen if the maximum session duration is exceeded or if audio is sent faster than real-time.
    Refer to the documentation for a full list of error codes.
  • Incorrect Audio Format: The Universal-Streaming model expects a specific audio format and sample rate. Ensure you are streaming audio with a sample rate of at least 16000Hz. Mismatched sample rates can lead to poor transcription accuracy.
  • Handling Network Interruptions: Network instability can disrupt the audio stream. Your application should include logic to catch connection errors and attempt to reconnect. The Python SDK handles some of this automatically, but for production systems, building a resilient reconnection strategy is a good practice.

Complete example

Here's the complete working example:

import assemblyai as aai
from typing import Type
from dotenv import load_dotenv
import os
from assemblyai.streaming.v3 import (
   BeginEvent,
   StreamingClient,
   StreamingClientOptions,
   StreamingError,
   StreamingEvents,
   StreamingParameters,
   StreamingSessionParameters,
   TerminationEvent,
   TurnEvent,
)

load_dotenv()

api_key = os.getenv('ASSEMBLYAI_API_KEY')

def on_begin(self: Type[StreamingClient], event: BeginEvent):
   print(f"Session started: {event.id}")

def on_turn(self: Type[StreamingClient], event: TurnEvent):
   print(f"{event.transcript} ({event.end_of_turn})")

   if event.end_of_turn and not event.turn_is_formatted:
       params = StreamingSessionParameters(
           format_turns=True,
       )
       self.set_params(params)

def on_terminated(self: Type[StreamingClient], event: TerminationEvent):
   print(
       f"Session terminated: {event.audio_duration_seconds} seconds of audio processed"
   )

def on_error(self: Type[StreamingClient], error: StreamingError):
   print(f"Error occurred: {error}")

def main():
   client = StreamingClient(
       StreamingClientOptions(
           api_key=api_key,
           api_host="streaming.assemblyai.com"
       )
   )

   client.on(StreamingEvents.Begin, on_begin)
   client.on(StreamingEvents.Turn, on_turn)
   client.on(StreamingEvents.Termination, on_terminated)
   client.on(StreamingEvents.Error, on_error)

   client.connect(
       StreamingParameters(
           sample_rate=16000,
           format_turns=True,
       )
   )

   try:
       client.stream(
           aai.extras.MicrophoneStream(sample_rate=16000)
       )
   finally:
       client.disconnect(terminate=True)

if __name__ == "__main__":
   main()

Best practices for Universal-Streaming

To get the best results from Universal-Streaming:

  • Audio Configuration: Use ≥16kHz sample rates for optimal accuracy
  • Connection Management: Maintain persistent WebSocket connections to minimize latency
  • Processing Optimization: Use format_turns=False for faster voice agent processing
  • Error Handling: Implement StreamingError exception handling with reconnection logic
  • Security: Generate temporary tokens server-side for client applications

Use cases for Universal-Streaming

Universal-Streaming is designed for applications that need ultra-fast, accurate speech recognition:

  • Voice agents: Build conversational AI with natural turn-taking
  • Live captioning: Provide real-time subtitles for meetings and events
  • Voice assistants: Create responsive voice interfaces
  • Call center analytics: Analyze customer conversations in real-time. This is a significant market, as industry analysis shows the financial services sector alone accounts for over $100 billion in annual contact center spending.
  • Meeting transcription: Document discussions as they happen
Try Universal-Streaming in your browser

 Test real-time transcription with your microphone and see immutable turns appear in ~300ms—no code required. Great for voice assistants, live captions, and more.

Test Now

Getting started with real-time Speech AI

You now have the tools to build responsive, accurate voice applications using Python and AssemblyAI's Universal-Streaming model. With immutable transcripts delivered at ~300ms latency and intelligent conversational turn detection, you can focus on creating a great user experience instead of managing complex speech-to-text infrastructure.

The combination of performance, accuracy, and developer-friendly features makes Universal-Streaming a solid foundation for any real-time Speech AI project. In fact, a survey of tech leaders confirms that performance, accuracy, and ease of use are among the most important factors when choosing an AI vendor. To explore further, check out the official documentation or start building your first application. Try our API for free.

Frequently asked questions about real-time transcription implementation

How do I handle network interruptions during a streaming session?

Wrap streaming logic in try/except blocks to catch StreamingError exceptions and implement custom reconnection strategies.

What audio formats and sample rates provide the best accuracy?

Use 16kHz+ sample rate with single-channel (mono) linear16 PCM encoding.

Can I stream audio from a file instead of a microphone?

Yes, replace aai.extras.MicrophoneStream with a custom function that reads audio files in chunks.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
Tutorial
Streaming Speech-to-Text
Python