Self-Hosted Streaming
Self-Hosted Streaming
The AssemblyAI Self-Hosted Streaming Solution provides a secure, low-latency real-time transcription service that can be deployed within your own infrastructure. Audio, transcripts, and PII never leave your network — only license validation and usage metadata are transmitted back to AssemblyAI.
Self-hosted streaming requires an upfront commercial commitment of $20,000. Contact our sales team to discuss your needs and learn more about our self-hosted offering.
The deployment instructions, Compose files, nginx configuration, and example clients are maintained in the public streaming-self-hosting-stack repository. This page covers what self-hosted streaming is, what you need to run it, and how the stack is shaped. Go to the repo for the actual setup steps.
What you can self-host
Self-hosted streaming ships as two separate stacks. Each stack serves one model family, runs from its own Docker Compose file, and uses its own GPU. You pick the stack that matches the model you want to serve — they are not designed to run side by side.
For product capabilities and accuracy details, see the Universal Streaming and Universal-3 Pro Streaming overview pages.
Core principle
- Complete data isolation. Audio, transcripts, and PII stay inside your infrastructure. The only outbound traffic is license validation and (for usage-based contracts) usage metadata to
https://usage-tracker.assemblyai.com.
System requirements
Hardware
- Universal Streaming. NVIDIA T4 or newer per ASR container. We recommend at least 4 CPU and 16 GB RAM per ASR container.
- Universal-3 Pro Streaming. NVIDIA L4, A10, A100, L40S, H100, or equivalent with at least 24 GB VRAM. The container also bundles ~14 GB of model weights, so plan disk accordingly. T4 GPUs are not sufficient for U3 Pro.
Software
- Operating system. Linux
- Container runtime. Docker and Docker Compose (v2 — the
docker composecommand, notdocker-compose) - NVIDIA Container Toolkit. Required for Docker to access the GPU
- AWS credentials. AssemblyAI provisions a scoped AWS access key for your team so you can pull container images from our private ECR registry
Architecture
Both stacks share the same gateway, load balancer, and license proxy — they only differ in the ASR backend.
Shared services (both stacks)
streaming-api— Gateway WebSocket service that clients connect to. Handles session lifecycle, audio framing, and routing to the ASR backend.license-and-usage-proxy— Validates the license file at startup and reports usage metadata (for usage-based contracts).streaming-asr-lb—nginx:alpineload balancer that routes ASR gRPC requests to the right backend based on theX-Model-Versionheader.
Universal Streaming stack
Adds two ASR backends:
streaming-asr-english— English speech recognition.streaming-asr-multilang— Multilingual speech recognition.
Universal-3 Pro Streaming stack
Adds a single ASR backend:
streaming-asr-u3pro— Universal-3 Pro speech recognition. Available as of v0.6.0.
Connection flow
The load balancer only forwards to backends that are actually deployed in the running stack — Universal Streaming routes for en-default and ml-default, U3 Pro routes for u3-pro.
Getting started
Follow the upstream repo’s README for the actual setup steps. At a high level:
- Get credentials and a license file from your AssemblyAI representative — an AWS access key scoped to ECR, and a
license.jwtfile. The same license file works for both stacks. - Install Docker, Docker Compose, and the NVIDIA Container Toolkit. See the README’s setup section for verification commands.
- Authenticate to ECR with the provided AWS credentials.
- Pick a stack and configure
.envwith the image references from the repo’s.env.example. - Start the stack with
docker compose up -d(Universal Streaming) ordocker compose -f docker-compose.u3pro.yml up -d(Universal-3 Pro Streaming).
Universal Streaming ASR containers take roughly 2 minutes to become ready and log Ready to serve!. The Universal-3 Pro Streaming ASR container takes roughly 5 minutes and logs U3Pro ASR Server ready!. Health checks may report unhealthy during startup — that is expected.
Running a test client
The repo ships an example Python client under streaming_example that streams a pre-recorded WAV file to the WebSocket endpoint. It supports all three speech models via the --speech-model flag:
universal-streaming-english— Universal Streaming, Englishuniversal-streaming-multilingual— Universal Streaming, multilingualu3-rt-pro— Universal-3 Pro Streaming
The client routes to the correct ASR backend automatically via the X-Model-Version header. Make sure the value you pass matches a backend deployed in the stack you started.
Switching between stacks
The two stacks listen on the same ports (streaming-api on 8080, ASR load balancer on the gRPC backend), so they cannot run simultaneously. To switch:
Production deployment
Per-service deployment strategy, resource sizing, autoscaling thresholds, health-check tuning, and the license-and-usage-proxy /v1/status endpoint reference all live in the repo’s Production Deployment Recommendations section.
Release notes and changelog
Release notes for the self-hosted stack — including per-version model improvements, API additions, and breaking changes — live in the repo’s README changelog. Tagged releases are visible on the Releases page.
Support
For deployment questions, image access, license issues, or to report bugs, contact your AssemblyAI representative.