Self-hosted deployment

Voice AI models hosted in your environment

Run AssemblyAI's Voice AI models on your own infrastructure to tighten latency, meet compliance requirements, and keep full control of your stack.

Our Voice AI models, running on your infrastructure

Self-host our speech-to-text models with the same accuracy and price-performance you get from AssemblyAI's cloud API.

Co-locate your Voice AI stack with the rest of your infrastructure so audio is processed close to where your traffic originates.

Keep every second of audio inside your environment, even while you're serving customers globally.

Tune scaling to match your exact traffic patterns. We provide the metrics and observability your autoscaling needs.

Run on any container orchestration platform — Kubernetes, AWS ECS, or whatever your team already uses.

Apply AssemblyAI usage to your cloud provider's committed spend so you get the discounts you've already negotiated.

Meet strict regulatory and data residency requirements by processing audio inside your controlled perimeter.

Run our Universal-3.5 Pro Realtime model and the Sync API with the same accuracy and speed you get from our cloud API.

Same usage-based pricing as the cloud — no self-hosting premium. Daily billing options and volume discounts included.

Full GPU support for maximum performance, with options for regions with hardware import limitations.

Enterprise cloud savings

AssemblyAI agreements can run through AWS Marketplace, so your usage counts toward your committed spend and helps you maximize cloud discounts.

: No. Self-hosting uses the same usage-based pricing as our cloud service with no extra fees. You only pay for active sessions, and you're eligible for volume-based discounts.
: Self-hosting can shave 50–200ms off latency in regions far from our cloud endpoints, like Australia, Singapore, or South America. If you already run close to our cloud, our standard API with Global Edge Routing typically delivers comparable performance.
: Our containerized deployment supports a range of GPU configurations. Each instance handles up to 48 concurrent streams without runtime degradation.
: We provide a Docker Compose demo for initial setup. For production deployments on Kubernetes, AWS ECS, or other orchestration platforms, our Applied AI Engineers work directly with your team.
: Yes. Self-hosting works in any region, including government clouds like GovCloud and countries with data sovereignty or hardware import restrictions.
: Both the Realtime Speech-to-Text API (Universal-3.5 Pro Realtime) and the Sync Speech-to-Text API (powered by Universal-3.5 Pro) are available for self-hosted deployment — live streaming sessions and single-call transcription on short clips, running entirely inside your environment.