Self-Hosted Streaming

The AssemblyAI Self-Hosted Streaming Solution provides a secure, low-latency real-time transcription solution that can be deployed within your own infrastructure. This early access version is designed for design partners to evaluate and provide feedback on our self-hosted offering.

Getting the latest instructions

The most up-to-date deployment instructions, configuration files, and example scripts are maintained in our private GitHub repository:

https://github.com/AssemblyAI/streaming-self-hosting-stack

Design partners are encouraged to provide their GitHub username to gain access to the repository. Please contact the AssemblyAI team directly to request access.

Core principle

  • Complete data isolation: No audio data, transcript data, or personally identifiable information (PII) will ever be sent to AssemblyAI servers. Only usage metadata and licensing information is transmitted.

System requirements

Hardware requirements

  • GPU: NVIDIA GPU support required (any NVIDIA GPU model will work, T4 or newer recommended)

Software requirements

  • Operating System: Linux
  • Container Runtime: Docker and Docker Compose required
  • AWS Account: Required for pulling container images from our ECR registry

Architecture

The streaming solution consists of three AssemblyAI Docker images plus a standard nginx container:

  1. API Service (streaming-api) - Gateway API service handling WebSocket connections
  2. English ASR Service (streaming-asr-english) - English speech recognition model service
  3. Multilingual ASR Service (streaming-asr-multilang) - Multilingual speech recognition model service
  4. ASR Load Balancer (streaming-asr-lb) - Standard nginx:alpine container with header-based routing between ASR services

Connection flow

External Request → streaming-api:8080 (WebSocket) → streaming-asr-lb:80 → Header-based routing (X-Model-Version):
├── en-default → streaming-asr-english:50051 (gRPC)
└── ml-default → streaming-asr-multilang:50051 (gRPC)

Prerequisites

  • Active enterprise contract with AssemblyAI
  • AWS account for container registry access
  • Linux environment with Docker and Docker Compose installed
  • NVIDIA Container Toolkit for GPU support

Setup and deployment

1. Docker runtime with GPU support

1.1 Verify NVIDIA drivers are installed:

$nvidia-smi

1.2 Install NVIDIA Container Toolkit:

Follow the NVIDIA Container Toolkit installation guide to set up GPU support for Docker.

1.3 Verify the Docker runtime has GPU access:

$docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

2. Obtain credentials

AWS ECR Access: We will manually provision AWS account credentials for your team to pull container images from our private Amazon ECR registry.

3. AWS ECR authentication

Authenticate with AWS ECR using provided credentials:

$aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 344839248844.dkr.ecr.us-west-2.amazonaws.com

4. Configure container images

Create a .env file with container image references:

$STREAMING_API_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-api:release-v0.1.0
>STREAMING_ASR_ENGLISH_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-english:release-v0.1.0
>STREAMING_ASR_MULTILANG_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-multilang:release-v0.1.0

5. Deploy with Docker Compose

Start all services:

$# Start all services
>docker compose up -d
>
># View logs
>docker compose logs -f
>
># Check service status
>docker compose ps

The ASR service containers include built-in model weights - no separate model download required.

Configuration

Docker Compose configuration

The docker-compose.yml file defines the service architecture:

1services:
2 streaming-api:
3 image: ${STREAMING_API_IMAGE}
4 ports:
5 - "8080:8080"
6 environment:
7 - AAI_WSS_PORT=8080
8 - AAI_ASR_ENDPOINT=streaming-asr-lb:80
9 - AAI_STREAMING_ASR_ENDPOINT=streaming-asr-lb:80
10 - AAI_USE_SECURE_CHANNEL_TO_ASR_SERVICE=False
11 healthcheck:
12 test: ["CMD", "curl", "-f", "http://localhost:8080/v3/health"]
13 interval: 10s
14 timeout: 2s
15 retries: 2
16 start_period: 5s
17 depends_on:
18 - streaming-asr-lb
19 networks:
20 - streaming-network
21
22 streaming-asr-lb:
23 image: nginx:alpine
24 ports:
25 - "8081:80"
26 healthcheck:
27 test: [ "CMD", "curl", "-fsS", "http://localhost:80/nginx_health" ]
28 interval: 10s
29 timeout: 2s
30 retries: 2
31 start_period: 10s
32 volumes:
33 - ./nginx_streaming_asr.conf:/etc/nginx/nginx.conf:ro
34 depends_on:
35 - streaming-asr-english
36 - streaming-asr-multilang
37 networks:
38 - streaming-network
39
40 streaming-asr-english:
41 image: ${STREAMING_ASR_ENGLISH_IMAGE}
42 ports:
43 - "50051:50051"
44 environment:
45 - SERVER_PORT=50051
46 - LOGGING_LEVEL=INFO
47 healthcheck:
48 test: ["CMD", "grpc_health_probe", "-addr=:50051"]
49 interval: 10s
50 timeout: 2s
51 retries: 5
52 start_period: 120s
53 networks:
54 - streaming-network
55 deploy:
56 resources:
57 reservations:
58 devices:
59 - driver: nvidia
60 count: 1
61 capabilities: [ "gpu" ]
62
63 streaming-asr-multilang:
64 image: ${STREAMING_ASR_MULTILANG_IMAGE}
65 ports:
66 - "50052:50051"
67 environment:
68 - SERVER_PORT=50051
69 - LOGGING_LEVEL=INFO
70 healthcheck:
71 test: ["CMD", "grpc_health_probe", "-addr=:50051"]
72 interval: 10s
73 timeout: 2s
74 retries: 5
75 start_period: 120s
76 networks:
77 - streaming-network
78 deploy:
79 resources:
80 reservations:
81 devices:
82 - driver: nvidia
83 count: 1
84 capabilities: [ "gpu" ]
85
86networks:
87 streaming-network:
88 driver: bridge
89 ipam:
90 config:
91 - subnet: 172.20.0.0/16

Nginx configuration

The ASR load balancer uses header-based routing to direct requests to the appropriate model service based on the X-Model-Version header:

nginx_streaming_asr.conf

1events { worker_connections 1024; }
2
3http {
4 access_log /dev/stdout;
5 error_log /dev/stderr info;
6
7 upstream streaming_asr_english { server streaming-asr-english:50051; }
8 upstream streaming_asr_multilang { server streaming-asr-multilang:50051; }
9
10 map $http_x_model_version $asr_backend {
11 default streaming_asr_english;
12 en-default streaming_asr_english;
13 ml-default streaming_asr_multilang;
14 }
15
16 keepalive_timeout 10h;
17 client_header_timeout 10h;
18 send_timeout 10h;
19
20 server {
21 listen 80;
22 http2 on;
23 client_max_body_size 0;
24
25 # Health endpoint (NGINX itself)
26 location = /nginx_health {
27 access_log off;
28 default_type text/plain;
29 return 200 "OK\n";
30 }
31
32 location / {
33 grpc_pass grpc://$asr_backend;
34 grpc_connect_timeout 75s;
35 grpc_read_timeout 10h;
36 grpc_send_timeout 10h;
37 grpc_socket_keepalive on;
38 }
39 }
40}

Service endpoints

  • WebSocket: ws://localhost:8080

Running the streaming example

A Python example script is provided to demonstrate how to stream a pre-recorded audio file to the self-hosted stack.

Note: You can initiate a session as soon as the streaming-asr-english and streaming-asr-multilang containers are healthy, which happens after they output a "Ready to serve!" log line.

Setup

Change to the streaming_example directory:

$cd streaming_example

Create a fresh Python virtual environment and activate it:

$python -m venv streaming_venv
>source streaming_venv/bin/activate

Install the required packages:

$pip install -r requirements.txt

Python script

Save this as example_with_prerecorded_audio_file.py:

1"""
2Example script for streaming audio to AssemblyAI's self-hosted streaming transcription API.
3This is a minimal reference implementation for demonstration purposes only.
4For production use cases, best practices, and the complete API specification, please visit https://www.assemblyai.com/docs
5"""
6
7import argparse
8import json
9import logging
10import math
11import os
12import time
13import wave
14from concurrent.futures import ThreadPoolExecutor
15from dataclasses import dataclass
16from datetime import datetime, timedelta
17from typing import List, Optional
18from urllib.parse import urlencode
19
20from websockets.sync.client import ClientConnection, connect
21
22LOGGER = logging.getLogger(__name__)
23
24
25@dataclass(frozen=True)
26class AudioChunk:
27 data: bytes
28 duration_ms: int
29
30
31def _validate_and_get_pcm16_raw_bytes(
32 wav_file_path: str, expected_sample_rate: int
33) -> bytes:
34 """
35 Validate that the WAV file is PCM16 encoded with the expected sample rate and extract raw audio data.
36
37 :param wav_file_path: Path to the WAV file.
38 :param expected_sample_rate: Expected sample rate (e.g., 16000).
39 :return: Raw audio content as bytes.
40 :raises ValueError: If the file is not PCM16 or doesn't match expected sample rate.
41 """
42 with wave.open(wav_file_path, "rb") as wav_file:
43 # Check if it's PCM16
44 if wav_file.getsampwidth() != 2:
45 raise ValueError(
46 f"Audio file must be 16-bit PCM. Found sample width: {wav_file.getsampwidth() * 8}-bit"
47 )
48
49 if wav_file.getcomptype() != "NONE":
50 raise ValueError(
51 f"Audio file must be uncompressed PCM. Found compression type: {wav_file.getcomptype()}"
52 )
53
54 # Check sample rate
55 actual_sample_rate = wav_file.getframerate()
56 if actual_sample_rate != expected_sample_rate:
57 raise ValueError(
58 f"Audio file must have sample rate of {expected_sample_rate} Hz. "
59 f"Found: {actual_sample_rate} Hz"
60 )
61
62 # Check if mono
63 if wav_file.getnchannels() != 1:
64 raise ValueError(
65 f"Audio file must be mono (1 channel). Found: {wav_file.getnchannels()} channels"
66 )
67
68 raw_audio = wav_file.readframes(wav_file.getnframes())
69
70 return raw_audio
71
72
73def _get_chunks_from_file(
74 filepath: str,
75 sample_rate: int,
76 chunk_size_ms: int,
77) -> List[AudioChunk]:
78 """
79 Read a PCM16 WAV file and split it into chunks.
80
81 :param filepath: Path to the PCM16 WAV file.
82 :param sample_rate: Expected sample rate of the audio file.
83 :param chunk_size_ms: Duration of each chunk in milliseconds.
84 :return: List of AudioChunk objects.
85 :raises ValueError: If the file is not in the correct format.
86 """
87 chunks = []
88 audio_bytes: bytes = _validate_and_get_pcm16_raw_bytes(filepath, sample_rate)
89
90 read_bytes = 0
91 while read_bytes < len(audio_bytes):
92 frame_size = 2 # 16-bit PCM (2 bytes per sample)
93 chunk_bytes_len = int(sample_rate * chunk_size_ms * frame_size // 1000)
94 data = audio_bytes[read_bytes : read_bytes + chunk_bytes_len]
95 read_bytes += len(data)
96 actual_chunk_ms = math.ceil(len(data) * 1000 / (sample_rate * frame_size))
97 chunks.append(AudioChunk(data=data, duration_ms=actual_chunk_ms))
98
99 return chunks
100
101
102def _write_to_ws(ws: ClientConnection, audio_chunks: List[AudioChunk]) -> None:
103 """
104 Write audio chunks to the WebSocket connection.
105
106 :param ws: WebSocket connection.
107 :param audio_chunks: List of audio chunks to send.
108 """
109 try:
110 for chunk in audio_chunks:
111 # Sleep for the chunk duration to send chunks with realtime rate
112 time.sleep(chunk.duration_ms / 1000)
113 ws.send(chunk.data)
114 ws.send('{"type": "Terminate"}')
115 except Exception as e:
116 LOGGER.error(
117 f"Exception occurred while writing to websocket: {e}", exc_info=True
118 )
119 ws.close()
120 raise
121
122
123def _read_from_ws(ws: ClientConnection) -> None:
124 """
125 Read and process messages from the WebSocket connection.
126
127 :param ws: WebSocket connection.
128 """
129 try:
130 for message in ws:
131 data = json.loads(message)
132 if "type" not in data:
133 raise Exception(f"Unknown message received: {data}")
134 elif data["type"] == "Turn":
135 if data["words"]:
136 text = " ".join([word["text"] for word in data["words"]])
137 audio_start = data["words"][0]["start"]
138 audio_end = data["words"][-1]["end"]
139 end_of_turn = "True " if data["end_of_turn"] else "False"
140 LOGGER.info(
141 f"{timedelta(milliseconds=audio_start)}-"
142 f"{timedelta(milliseconds=audio_end)}, end-of-turn: {end_of_turn}: {text}",
143 )
144 elif data["type"] == "Begin":
145 expires_at = datetime.fromtimestamp(int(data["expires_at"]))
146 LOGGER.info(
147 f"Session started. Session id: {data['id']}, expires at: {expires_at}",
148 )
149 elif data["type"] == "Termination":
150 LOGGER.info(
151 f"Session completed with session duration: {data['session_duration_seconds']} sec.",
152 )
153 else:
154 LOGGER.error(f"Unknown message type: {data}")
155 except Exception as e:
156 LOGGER.error(
157 f"Exception occurred while reading from the websocket: {e}", exc_info=True
158 )
159 ws.close()
160 raise
161
162
163def run_session(
164 api_endpoint: str,
165 audio_chunks: List[AudioChunk],
166 sample_rate: int,
167 keyterms_prompt: Optional[List[str]] = None,
168 language: Optional[str] = None,
169) -> None:
170 """
171 Run a WebSocket session to stream audio and receive transcriptions.
172
173 :param api_endpoint: WebSocket endpoint URL.
174 :param audio_chunks: List of audio chunks to send.
175 :param sample_rate: Sample rate of the audio.
176 :param keyterms_prompt: Optional list of key terms for the transcription.
177 :param language: Optional language code for transcription.
178 """
179 try:
180 params = {
181 "sample_rate": sample_rate,
182 }
183 if keyterms_prompt:
184 params["keyterms"] = json.dumps(keyterms_prompt)
185 if language:
186 params["language"] = language
187
188 endpoint_str = f"{api_endpoint}?{urlencode(params)}"
189 headers = {"Authorization": "self-hosted"}
190 LOGGER.info(f"Endpoint: {endpoint_str}")
191 with ThreadPoolExecutor(max_workers=2) as executor:
192 with connect(endpoint_str, additional_headers=headers) as websocket:
193 write_future = executor.submit(
194 _write_to_ws,
195 websocket,
196 audio_chunks,
197 )
198 read_future = executor.submit(
199 _read_from_ws,
200 websocket,
201 )
202 write_future.result()
203 read_future.result()
204 except Exception as e:
205 LOGGER.error(
206 f"Exception occurred: {e}",
207 exc_info=True,
208 )
209 raise
210
211
212def parse_args():
213 """Parse command line arguments."""
214 parser = argparse.ArgumentParser(
215 description="Stream audio to AssemblyAI self-hosted real-time transcription service",
216 formatter_class=argparse.RawDescriptionHelpFormatter,
217 epilog="""
218Examples:
219 # Basic usage with default endpoint
220 python example_with_prerecorded_audio_file.py --audio-file example_audio_file.wav
221
222 # Specify custom endpoint and language
223 python example_with_prerecorded_audio_file.py --audio-file example_audio_file.wav --endpoint ws://localhost:8080 --language multi
224
225Note: Audio file must be PCM 16-bit WAV format, mono channel, 16kHz sample rate.
226 """,
227 )
228 parser.add_argument(
229 "--audio-file",
230 type=str,
231 default=os.path.dirname(__file__) + os.path.sep + "example_audio_file.wav",
232 help="Path to the audio file to transcribe (must be PCM 16-bit WAV, mono, 16kHz)",
233 )
234 parser.add_argument(
235 "--endpoint",
236 type=str,
237 default="ws://localhost:8080",
238 help="WebSocket endpoint URL (default: ws://localhost:8080)",
239 )
240 parser.add_argument(
241 "--language",
242 type=str,
243 default="",
244 help="Language code for transcription (e.g., 'multi')",
245 )
246 return parser.parse_args()
247
248
249if __name__ == "__main__":
250 try:
251 args = parse_args()
252 logging.basicConfig(level=logging.INFO, format="%(message)s")
253 sample_rate = 16_000
254
255 audio_chunks = _get_chunks_from_file(
256 args.audio_file,
257 sample_rate=sample_rate,
258 chunk_size_ms=100,
259 )
260 run_session(
261 api_endpoint=args.endpoint,
262 audio_chunks=audio_chunks,
263 sample_rate=sample_rate,
264 language=args.language if args.language else None,
265 )
266 except KeyboardInterrupt:
267 LOGGER.info("Interrupted by user, exiting.")
268 exit(0)
269 except ValueError as e:
270 LOGGER.error(f"Audio file validation error: {e}")
271 exit(1)

Usage

The example script (example_with_prerecorded_audio_file.py) requires a PCM 16-bit WAV file (mono channel, 16kHz sample rate).

Note on language parameter:

  • Use "en" or omit the --language parameter for English transcription (routes to English ASR service)
  • Use "multi" or any non-English language code for multilingual transcription (routes to multilingual ASR service)

Basic usage:

$python example_with_prerecorded_audio_file.py --audio-file example_audio_file.wav

Example with multilingual transcription:

$python example_with_prerecorded_audio_file.py \
> --audio-file example_audio_file.wav \
> --endpoint ws://localhost:8080 \
> --language multi

Command-line arguments:

ArgumentDescriptionDefault
--audio-filePath to the audio file to transcribe (must be PCM 16-bit WAV, mono, 16kHz)example_audio_file.wav
--endpointWebSocket endpoint URLws://localhost:8080
--languageLanguage code for transcription. Use "en" for English or omit for English (default). Use "multi" for multilingual"en"

View help:

$python example_with_prerecorded_audio_file.py --help

Live microphone streaming example

This example demonstrates real-time microphone transcription using a remote self-hosted deployment. This is useful for testing your self-hosted instance from a local machine.

Setup

Install the required packages:

$pip install websockets pyaudio

Python script

Save this as live_microphone_streaming.py:

1import asyncio
2import websockets
3import pyaudio
4import json
5
6# Replace with your server's IP address or use 'localhost' for local testing
7SERVER_IP = "your.server.ip.address"
8
9async def stream_audio(language="en"):
10 # Build WebSocket URL with query parameters
11 params = f"sample_rate=16000&language={language}"
12 WS_URL = f"ws://{SERVER_IP}:8080/v3/ws?{params}"
13
14 # Add authorization header (required for self-hosted)
15 headers = {"Authorization": "self-hosted"}
16
17 print(f"Connecting to {WS_URL}...")
18
19 async with websockets.connect(WS_URL, extra_headers=headers) as ws:
20 print("Connected! Starting to stream audio...")
21
22 # Set up audio stream from microphone
23 p = pyaudio.PyAudio()
24 stream = p.open(
25 format=pyaudio.paInt16,
26 channels=1,
27 rate=16000,
28 input=True,
29 frames_per_buffer=3200 # 100ms chunks at 16kHz
30 )
31
32 print(f"\n🎤 Listening with language={language}... speak into your microphone!")
33 print("Press Ctrl+C to stop\n")
34
35 # Function to send audio
36 async def send_audio():
37 try:
38 while True:
39 data = stream.read(3200, exception_on_overflow=False)
40 await ws.send(data)
41 await asyncio.sleep(0.1) # 100ms chunks
42 except KeyboardInterrupt:
43 await ws.send(json.dumps({"type": "Terminate"}))
44 print("\nStopping...")
45 finally:
46 stream.stop_stream()
47 stream.close()
48 p.terminate()
49
50 # Function to receive transcripts
51 async def receive_transcripts():
52 try:
53 async for message in ws:
54 data = json.loads(message)
55
56 if data.get("type") == "Begin":
57 print(f"✅ Session started! ID: {data.get('id')}")
58
59 elif data.get("type") == "Turn":
60 if data.get("words"):
61 text = " ".join([word["text"] for word in data["words"]])
62 end_of_turn = "[FINAL]" if data.get("end_of_turn") else ""
63 print(f"📝 {text} {end_of_turn}")
64
65 elif data.get("type") == "Termination":
66 print(f"✅ Session completed. Duration: {data.get('session_duration_seconds')}s")
67 break
68
69 except Exception as e:
70 print(f"Error receiving: {e}")
71
72 # Run both tasks concurrently
73 await asyncio.gather(send_audio(), receive_transcripts())
74
75if __name__ == "__main__":
76 import sys
77
78 # Usage: python live_microphone_streaming.py [language]
79 # Examples:
80 # python live_microphone_streaming.py en # English
81 # python live_microphone_streaming.py multi # Multilingual with auto-detect
82 # python live_microphone_streaming.py es # Spanish
83 language = sys.argv[1] if len(sys.argv) > 1 else "en"
84
85 try:
86 asyncio.run(stream_audio(language))
87 except KeyboardInterrupt:
88 print("\nStopped by user")

Usage

Basic usage (English):

$python live_microphone_streaming.py

Multilingual transcription:

$python live_microphone_streaming.py multi

Specific language (e.g., Spanish):

$python live_microphone_streaming.py es

Note:

  • Make sure to replace SERVER_IP in the script with your actual server IP address
  • If testing locally on the same machine as the server, use localhost or 127.0.0.1
  • The Authorization: self-hosted header is required for all connections
  • Language routing: "en" routes to English ASR service, any other code (including "multi") routes to multilingual ASR service

Updating services

Model updates

To update to a new model version:

  1. Pull the new container images from ECR
  2. Update your .env file with the new image references
  3. Restart the services using Docker Compose
$docker compose down
>docker compose up -d

Monitoring and debugging

View service logs

$# All services
>docker compose logs -f
>
># Specific service
>docker compose logs -f streaming-api

Check service status

$# Container status
>docker compose ps
>
># Resource usage
>docker stats

Troubleshooting

Debug commands

$# Check nginx configuration
>docker compose exec streaming-asr-lb nginx -t
>
># Restart specific service
>docker compose restart streaming-api
>docker compose restart streaming-asr-english
>docker compose restart streaming-asr-multilang

Common issues

  • GPU not detected: Verify NVIDIA Container Toolkit is properly installed and Docker has GPU access.

  • Services not starting: Check logs for specific error messages using docker compose logs -f [service-name].

  • Connection refused: Ensure all services are healthy by checking docker compose ps and reviewing health check status.

Current limitations

As a design partner, please be aware of these current limitations:

  • Text formatting is not included (coming in future streaming model release)
  • Manual credential provisioning (no self-service dashboard yet)
  • Docker Compose deployment example only (production orchestration templates coming later)

Design partner support

What we provide

  • Docker Compose configuration file
  • Manual credential provisioning
  • Direct engineering support for deployment
  • Regular model updates

What we need from you

  • Feedback on deployment experience
  • Performance metrics in your environment
  • Feature requests and prioritization input
  • Use case validation

AWS deployment guide

This section provides step-by-step instructions for deploying the self-hosted streaming solution on AWS EC2, designed for users who may not be familiar with AWS infrastructure.

AWS prerequisites

Before you begin, ensure you have:

  • An AWS account with billing enabled
  • AWS CLI installed and configured on your local machine
  • Basic familiarity with SSH and command-line operations

EC2 instance setup

1. Request GPU quota increase

By default, AWS accounts have limited or zero quota for GPU instances. You’ll need to request an increase:

  1. Navigate to the AWS Service Quotas console
  2. Search for “EC2”
  3. Find “Running On-Demand G and VT instances” (for g4dn, g5, or similar GPU instances)
  4. Click “Request quota increase”
  5. Request at least 4 vCPUs (minimum for a g4dn.xlarge instance)
  6. Provide a use case description: “Self-hosted AI transcription service requiring GPU acceleration”
  7. Submit the request

Note: Quota requests typically take 24-48 hours to process. Plan accordingly.

2. Choose the right instance type

Recommended instance types based on your needs:

Instance TypevCPUsGPUMemoryUse CaseApproximate Cost/Hour
g4dn.xlarge41x T4 (16GB)16 GBDevelopment/Testing~$0.526
g4dn.2xlarge81x T4 (16GB)32 GBLight Production~$0.752
g5.xlarge41x A10G (24GB)16 GBProduction (Higher Performance)~$1.006
g5.2xlarge81x A10G (24GB)32 GBProduction (High Throughput)~$1.212

Recommendation: Start with g4dn.xlarge for evaluation, then scale to g4dn.2xlarge or g5 instances for production workloads.

3.1 Navigate to the EC2 console and click “Launch Instance”

3.2 Configure instance settings:

  • Name: assemblyai-self-hosted-streaming
  • AMI: Search for and select “AWS Deep Learning AMI GPU PyTorch 2.0.1 (Ubuntu 20.04)”
    • AMI ID format: ami-xxxxxxxxx (varies by region)
    • This AMI includes pre-installed NVIDIA drivers, CUDA toolkit, and Docker with GPU support
  • Instance type: Select g4dn.xlarge (or your chosen instance type)
  • Key pair: Create a new key pair or select an existing one
    • If creating new: Download the .pem file and save it securely
    • Set permissions: chmod 400 your-key.pem

3.3 Configure storage:

  • Root volume: Increase to at least 100 GB gp3 (model weights and containers require significant space)
  • The default 8 GB is insufficient

3.4 Configure security group (Network settings):

Create a new security group with the following inbound rules:

TypeProtocolPort RangeSourceDescription
SSHTCP22Your IP/0.0.0.0/0SSH access for management
Custom TCPTCP8080Your IP/0.0.0.0/0WebSocket endpoint
Custom TCPTCP8081Your IP/0.0.0.0/0Health check endpoint (optional)

Security recommendations:

  • For production: Restrict Source to your specific IP addresses or VPC CIDR ranges
  • For development/testing: You can use 0.0.0.0/0 but understand this allows public access
  • Consider using AWS VPN or Direct Connect for enhanced security
  • Enable AWS CloudTrail for audit logging

3.5 Launch the instance and wait for it to reach “Running” state

4. Connect to your EC2 instance

$# Replace with your instance's public IP and key file
>ssh -i your-key.pem ubuntu@<EC2_PUBLIC_IP>

5. Verify GPU and Docker setup

Once connected, verify the pre-installed components:

$# Verify NVIDIA drivers
>nvidia-smi
>
># Verify Docker
>docker --version
>
># Verify Docker Compose (v2 syntax)
>docker compose version
>
># If the above fails, you may need to install Docker Compose v2
># Remove old version if present
>sudo apt-get remove docker-compose
>
># Install Docker Compose v2 (plugin)
>sudo apt-get update
>sudo apt-get install docker-compose-plugin
>
># Verify installation
>docker compose version
>
># Verify GPU access in Docker
>docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Important: This setup uses Docker Compose v2, which uses the command docker compose (space, no hyphen) instead of the older docker-compose (hyphen). All commands in this guide use the v2 syntax.

6. Configure AWS credentials on the instance

Set up AWS credentials to pull container images from ECR:

$# Install AWS CLI if not already installed
>sudo apt-get update
>sudo apt-get install -y awscli
>
># Configure AWS credentials (use the credentials provided by AssemblyAI)
>aws configure

You’ll be prompted to enter:

  • AWS Access Key ID
  • AWS Secret Access Key
  • Default region: us-west-2
  • Default output format: json

7. Deploy the self-hosted streaming solution

Follow the standard deployment instructions from the “Setup and deployment” section above:

$# Authenticate with ECR
>aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 344839248844.dkr.ecr.us-west-2.amazonaws.com
>
># Create project directory
>mkdir -p ~/assemblyai-streaming
>cd ~/assemblyai-streaming
>
># Create .env file with image references
>cat > .env << 'EOF'
>STREAMING_API_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-api:release-v0.1.0
>STREAMING_ASR_ENGLISH_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-english:release-v0.1.0
>STREAMING_ASR_MULTILANG_IMAGE=344839248844.dkr.ecr.us-west-2.amazonaws.com/self-hosted-streaming-asr-multilang:release-v0.1.0
>EOF
>
># Create docker-compose.yml file
># Copy the complete docker-compose.yml content from the Configuration section above and save it
># Or download it from the GitHub repository
>
># Create nginx configuration file
># Copy the nginx_streaming_asr.conf content from the Configuration section above and save it
># Or download it from the GitHub repository
>
># Start services
>docker compose up -d
>
># Monitor logs (services may take 2-3 minutes to fully start)
>docker compose logs -f

Important startup notes:

  • The ASR services (streaming-asr-english and streaming-asr-multilang) take approximately 2-3 minutes to fully initialize
  • You’ll see "Ready to serve!" in the logs when each ASR service is ready
  • Health checks may show “unhealthy” during startup - this is normal
  • Wait until both ASR services show "Ready to serve!" before attempting to use the API

8. Test the deployment

From your local machine, test the connection using the live microphone example (see the Live microphone streaming example section above).

Important: Replace SERVER_IP in the example script with your EC2 instance’s public IP address, which you can find in the EC2 console under your instance details.

AWS cost optimization tips

  • Use Spot Instances: Save up to 70% for non-critical workloads (may be interrupted)
  • Stop instances when not in use: GPU instances are expensive; stop them during off-hours
  • Use CloudWatch alarms: Set up billing alerts to avoid unexpected costs
  • Consider Reserved Instances: Save up to 60% with 1 or 3-year commitments for production workloads
  • Right-size your instance: Monitor GPU utilization and downgrade if consistently underutilized

Security best practices

  1. Enable AWS Systems Manager Session Manager for SSH-less access
  2. Use IAM roles instead of hardcoded credentials where possible
  3. Enable VPC Flow Logs for network monitoring
  4. Regular security updates: sudo apt update && sudo apt upgrade -y
  5. Use AWS Secrets Manager to store sensitive configuration
  6. Enable EBS encryption for data at rest
  7. Configure CloudWatch Logs for centralized logging
  8. Implement least privilege access with security groups and NACLs

Troubleshooting AWS-specific issues

Issue: “InsufficientInstanceCapacity” error when launching

  • Solution: Try a different availability zone within your region or a different instance type

Issue: Quota request denied or pending

  • Solution: Contact AWS Support through the console with your use case details

Issue: Cannot connect to EC2 instance

  • Solution: Verify security group allows SSH (port 22) from your IP
  • Solution: Check that you’re using the correct key pair and username (ubuntu for Ubuntu AMIs)

Issue: Docker containers fail to start with GPU errors

  • Solution: Verify NVIDIA Container Toolkit is properly configured
  • Solution: Check that the instance type has GPU resources

Issue: Services show “unhealthy” status

  • Solution: ASR services take 2-3 minutes to fully initialize - wait for “Ready to serve!” log messages
  • Solution: Health checks may fail during startup - this is normal and will resolve once services are ready

Issue: Connection refused when testing from local machine

  • Solution: Ensure you’re using the instance’s public IP address, not the private IP
  • Solution: Verify security group allows inbound traffic on port 8080 from your IP
  • Solution: Check that services are fully started with docker compose logs -f

Issue: “Authorization” header missing error

  • Solution: All WebSocket connections must include the header Authorization: self-hosted

Issue: Need to transfer files to EC2 instance (e.g., audio files)

  • Solution: Use SCP from your local machine:
    $scp -i your-key.pem local-file.wav ubuntu@<EC2_PUBLIC_IP>:~/destination/

Issue: High costs

  • Solution: Stop the instance when not in use
  • Solution: Review CloudWatch metrics to ensure you’re using the right instance size