ModelsUniversal Streaming

Universal Streaming

By default, Universal-Streaming is set to transcribe English audio. If you’d like to enable multilingual streaming (support for English, Spanish, French, German, Italian, and Portuguese), enable multilingual transcription instead.

Streaming is now available in EU-West via streaming.eu.assemblyai.com. To use the EU streaming endpoint, replace streaming.assemblyai.com with streaming.eu.assemblyai.com in your connection configuration.

Quickstart

In this quick guide you will learn how to use AssemblyAI’s Streaming Speech-to-Text feature to transcribe audio from your microphone.

To run this quickstart you will need:

  • Python or JavaScript installed
  • A valid AssemblyAI API key

To run the quickstart:

1

Create a new Python file (for example, main.py) and paste the code provided below inside.

2

Insert your API key to line 11.

3

Install the necessary libraries

$pip install websocket-client pyaudio
4

Run with python main.py

1import pyaudio
2import websocket
3import json
4import threading
5import time
6import wave
7from urllib.parse import urlencode
8from datetime import datetime
9
10# --- Configuration ---
11YOUR_API_KEY = "YOUR-API-KEY" # Replace with your actual API key
12
13CONNECTION_PARAMS = {
14 "sample_rate": 16000,
15 "format_turns": True, # Request formatted final transcripts
16}
17API_ENDPOINT_BASE_URL = "wss://streaming.assemblyai.com/v3/ws"
18API_ENDPOINT = f"{API_ENDPOINT_BASE_URL}?{urlencode(CONNECTION_PARAMS)}"
19
20# Audio Configuration
21FRAMES_PER_BUFFER = 800 # 50ms of audio (0.05s * 16000Hz)
22SAMPLE_RATE = CONNECTION_PARAMS["sample_rate"]
23CHANNELS = 1
24FORMAT = pyaudio.paInt16
25
26# Global variables for audio stream and websocket
27audio = None
28stream = None
29ws_app = None
30audio_thread = None
31stop_event = threading.Event() # To signal the audio thread to stop
32
33# WAV recording variables
34recorded_frames = [] # Store audio frames for WAV file
35recording_lock = threading.Lock() # Thread-safe access to recorded_frames
36
37# --- WebSocket Event Handlers ---
38
39
40def on_open(ws):
41 """Called when the WebSocket connection is established."""
42 print("WebSocket connection opened.")
43 print(f"Connected to: {API_ENDPOINT}")
44
45 # Start sending audio data in a separate thread
46 def stream_audio():
47 global stream
48 print("Starting audio streaming...")
49 while not stop_event.is_set():
50 try:
51 audio_data = stream.read(FRAMES_PER_BUFFER, exception_on_overflow=False)
52
53 # Store audio data for WAV recording
54 with recording_lock:
55 recorded_frames.append(audio_data)
56
57 # Send audio data as binary message
58 ws.send(audio_data, websocket.ABNF.OPCODE_BINARY)
59 except Exception as e:
60 print(f"Error streaming audio: {e}")
61 # If stream read fails, likely means it's closed, stop the loop
62 break
63 print("Audio streaming stopped.")
64
65 global audio_thread
66 audio_thread = threading.Thread(target=stream_audio)
67 audio_thread.daemon = (
68 True # Allow main thread to exit even if this thread is running
69 )
70 audio_thread.start()
71
72def on_message(ws, message):
73 try:
74 data = json.loads(message)
75 msg_type = data.get('type')
76
77 if msg_type == "Begin":
78 session_id = data.get('id')
79 expires_at = data.get('expires_at')
80 print(f"\nSession began: ID={session_id}, ExpiresAt={datetime.fromtimestamp(expires_at)}")
81 elif msg_type == "Turn":
82 transcript = data.get('transcript', '')
83 if data.get('end_of_turn'):
84 print('\r' + ' ' * 80 + '\r', end='')
85 print(transcript)
86 else:
87 print(f"\r{transcript}", end='')
88 elif msg_type == "Termination":
89 audio_duration = data.get('audio_duration_seconds', 0)
90 session_duration = data.get('session_duration_seconds', 0)
91 print(f"\nSession Terminated: Audio Duration={audio_duration}s, Session Duration={session_duration}s")
92 except json.JSONDecodeError as e:
93 print(f"Error decoding message: {e}")
94 except Exception as e:
95 print(f"Error handling message: {e}")
96
97def on_error(ws, error):
98 """Called when a WebSocket error occurs."""
99 print(f"\nWebSocket Error: {error}")
100 # Attempt to signal stop on error
101 stop_event.set()
102
103
104def on_close(ws, close_status_code, close_msg):
105 """Called when the WebSocket connection is closed."""
106 print(f"\nWebSocket Disconnected: Status={close_status_code}, Msg={close_msg}")
107
108 # Save recorded audio to WAV file
109 save_wav_file()
110
111 # Ensure audio resources are released
112 global stream, audio
113 stop_event.set() # Signal audio thread just in case it's still running
114
115 if stream:
116 if stream.is_active():
117 stream.stop_stream()
118 stream.close()
119 stream = None
120 if audio:
121 audio.terminate()
122 audio = None
123 # Try to join the audio thread to ensure clean exit
124 if audio_thread and audio_thread.is_alive():
125 audio_thread.join(timeout=1.0)
126
127
128def save_wav_file():
129 """Save recorded audio frames to a WAV file."""
130 if not recorded_frames:
131 print("No audio data recorded.")
132 return
133
134 # Generate filename with timestamp
135 timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
136 filename = f"recorded_audio_{timestamp}.wav"
137
138 try:
139 with wave.open(filename, 'wb') as wf:
140 wf.setnchannels(CHANNELS)
141 wf.setsampwidth(2) # 16-bit = 2 bytes
142 wf.setframerate(SAMPLE_RATE)
143
144 # Write all recorded frames
145 with recording_lock:
146 wf.writeframes(b''.join(recorded_frames))
147
148 print(f"Audio saved to: {filename}")
149 print(f"Duration: {len(recorded_frames) * FRAMES_PER_BUFFER / SAMPLE_RATE:.2f} seconds")
150
151 except Exception as e:
152 print(f"Error saving WAV file: {e}")
153
154
155# --- Main Execution ---
156def run():
157 global audio, stream, ws_app
158
159 # Initialize PyAudio
160 audio = pyaudio.PyAudio()
161
162 # Open microphone stream
163 try:
164 stream = audio.open(
165 input=True,
166 frames_per_buffer=FRAMES_PER_BUFFER,
167 channels=CHANNELS,
168 format=FORMAT,
169 rate=SAMPLE_RATE,
170 )
171 print("Microphone stream opened successfully.")
172 print("Speak into your microphone. Press Ctrl+C to stop.")
173 print("Audio will be saved to a WAV file when the session ends.")
174 except Exception as e:
175 print(f"Error opening microphone stream: {e}")
176 if audio:
177 audio.terminate()
178 return # Exit if microphone cannot be opened
179
180 # Create WebSocketApp
181 ws_app = websocket.WebSocketApp(
182 API_ENDPOINT,
183 header={"Authorization": YOUR_API_KEY},
184 on_open=on_open,
185 on_message=on_message,
186 on_error=on_error,
187 on_close=on_close,
188 )
189
190 # Run WebSocketApp in a separate thread to allow main thread to catch KeyboardInterrupt
191 ws_thread = threading.Thread(target=ws_app.run_forever)
192 ws_thread.daemon = True
193 ws_thread.start()
194
195 try:
196 # Keep main thread alive until interrupted
197 while ws_thread.is_alive():
198 time.sleep(0.1)
199 except KeyboardInterrupt:
200 print("\nCtrl+C received. Stopping...")
201 stop_event.set() # Signal audio thread to stop
202
203 # Send termination message to the server
204 if ws_app and ws_app.sock and ws_app.sock.connected:
205 try:
206 terminate_message = {"type": "Terminate"}
207 print(f"Sending termination message: {json.dumps(terminate_message)}")
208 ws_app.send(json.dumps(terminate_message))
209 # Give a moment for messages to process before forceful close
210 time.sleep(5)
211 except Exception as e:
212 print(f"Error sending termination message: {e}")
213
214 # Close the WebSocket connection (will trigger on_close)
215 if ws_app:
216 ws_app.close()
217
218 # Wait for WebSocket thread to finish
219 ws_thread.join(timeout=2.0)
220
221 except Exception as e:
222 print(f"\nAn unexpected error occurred: {e}")
223 stop_event.set()
224 if ws_app:
225 ws_app.close()
226 ws_thread.join(timeout=2.0)
227
228 finally:
229 # Final cleanup (already handled in on_close, but good as a fallback)
230 if stream and stream.is_active():
231 stream.stop_stream()
232 if stream:
233 stream.close()
234 if audio:
235 audio.terminate()
236 print("Cleanup complete. Exiting.")
237
238
239if __name__ == "__main__":
240 run()

Core concepts

For a message-by-message breakdown of a turn, see our Streaming API: Message Sequence Breakdown guide.

Universal-Streaming is built based upon two core concepts: Turn objects and immutable transcriptions.

Turn object

A Turn object is intended to correspond to a speaking turn in the context of voice agent applications, and therefore it roughly corresponds to an utterance in a broader context. We assign a unique ID to each Turn object, which is included in our response. Specifically, the Universal-Streaming response is formatted as follows:

1{
2 "turn_order": 1,
3 "end_of_turn": false,
4 "transcript": "modern medicine is",
5 "end_of_turn_confidence": 0.7,
6 "words": [
7 { "text": "modern", "word_is_final": true, ... },
8 { "text": "medicine", "word_is_final": true, ... },
9 { "text": "is", "word_is_final": true, ... },
10 { "text": "amazing", "word_is_final": false, ... }
11 ]
12}
  • turn_order: Integer that increments with each new turn
  • turn_is_formatted: Boolean indicating if the text in the transcript field has been formatted with punctuation, casing, and inverse text normalization (e.g. dates, times, phone numbers). This field is false by default. Set format_turns=true to enable formatting. Use end_of_turn to detect end of turn, not turn_is_formatted.
  • end_of_turn: Boolean indicating if this is the end of the current turn
  • transcript: String containing only finalized words
  • end_of_turn_confidence: Floating number (0-1) representing the confidence that the current turn has finished, i.e., the current speaker has completed their turn
  • words: List of Word objects with individual metadata

Each Word object in the words array includes:

  • text: The string representation of the word
  • word_is_final: Boolean indicating if the word is finalized, where a finalized word means the word won’t be altered in future transcription responses
  • start: Timestamp for word start
  • end: Timestamp for word end
  • confidence: Confidence score for the word

Do not use turn_is_formatted to detect end of turn. Use end_of_turn to determine when a speaker’s turn has completed.

Immutable transcription

AssemblyAI’s streaming system receives audio in a streaming fashion, it returns transcription responses in real-time using the format specified above. Unlike many other streaming speech-to-text models that implement the concept of partial/variable transcriptions to show transcripts in an ongoing manner, Universal-Streaming transcriptions are immutable. In other words, the text that has already been produced will not be overwritten in future transcription responses. Therefore, with Universal-Streaming, the transcriptions will be delivered in the following way:

1→ Hello my na
2→ Hello my name
3→ Hello my name
4→ Hello my name is
5→ Hello my name is Zac
6→ Hello my name is Zack

When an end of the current turn is detected, you receive a message with end_of_turn set to true. If you enable text formatting by setting format_turns=true, you will also receive a transcription response with turn_is_formatted set to true.

1→ Hello my name is Zack
2→ Hello, my name is Zack. (end_of_turn: true)

In this example, you may have noticed that the last word of each transcript may occasionally be a subword (“Zac” in the example shown above). Each Word object has the word_is_final field to indicate whether the model is confident that the last word is a completed word. Note that, except for the last word, word_is_final is always true.