The production ceiling: where voice agent stacks start showing their limits
The three production ceilings voice agent builders hit after shipping, from accents to compliance to noisy environments, and how to break through each one.

.png)

You shipped something. It works. And you're quietly starting to wonder if it should work better.
There's a specific moment most voice agent builders recognize in retrospect. The product is live. Users are calling in. The core loop—speech to text to LLM to speech—is holding together. Your vendor is fine. Not amazing, but fine. You stop thinking about the infrastructure and start thinking about features.
Then a customer in Manchester starts complaining that the agent mishears them half the time. Or a promising enterprise deal stalls when a security reviewer asks whether the speech model can be deployed behind their firewall. Or you bring the product to a demo at a loud conference, and suddenly you're watching your agent fall apart in exactly the conditions your users are actually in.
This is the production ceiling. It's not a feature you skipped or a bug you missed. It's the point where the choices you made when you were just trying to ship something start constraining what your product can become.
This post is for builders who are already in production—not choosing a stack from scratch, but asking harder questions about the one they have. Specifically: the three ceilings we see voice agent developers hit most often, and what it actually looks like to break through each one.
Ceiling 1: your model works in English. Does it work in production?
When developers evaluate speech-to-text vendors, they usually test with audio that sounds a lot like what they recorded to make the test: clean signal, minimal background noise, standard American English. Deepgram performs well in those conditions. Most vendors do.
The gap surfaces later.
A UK-based enterprise customer hears their own name mangled repeatedly by your support agent. A Canadian French speaker keeps getting routed to a fallback because the transcription accuracy drops below the threshold your LLM needs to reason reliably. A healthcare provider's billing codes—alphanumeric strings your model has never encountered in its training distribution—come through as gibberish. None of these showed up in your benchmark. All of them showed up in your support queue.
This isn't a subtle problem. It's a measurable one. The entity miss rate—how often a model fails to correctly transcribe proper nouns, alphanumeric strings, domain terminology, and accented speech—is one of the starkest differentiators between vendors once you leave the controlled test environment. Closely related to word error rate, entity accuracy is what actually determines whether your agent can complete the task. In independent evaluation, AssemblyAI's Universal-3 Pro Streaming model misses 16.7% of entities in mixed-input conditions. Deepgram Nova-3 misses 25.5%. On the inputs that actually break voice agents in production, that gap isn't incremental—it's the difference between a product that works for your whole user base and one that works for the segment that sounds like your test data.
What's telling is how vendors are responding to the gap. The pattern we've heard from builders is that Deepgram's long-tail language and dialect coverage lags significantly enough that they've been resourcing internal human labeling efforts to shore it up. That's not a sign of a solved problem—it's a sign of a vendor catching up. If you're building for a global user base or a specialized domain, you're not waiting for that process to complete.
The right way to know if this affects you is to test your actual production audio—not a curated sample, not a clean recording, but the real distribution of what your users sound like. A structured evaluation should cover at minimum: proper nouns and named entities specific to your domain, alphanumeric strings (account numbers, product codes, medication names), non-native speakers or regional dialect variants in your user base, and any specialized vocabulary your LLM relies on to reason correctly.
If you haven't run that test, you don't yet know which ceiling you're approaching.
Ceiling 2: enterprise wants on-prem. Your vendor can't go there.
There's a recurring pattern in enterprise voice agent sales that's worth naming explicitly.
The deal looks good. The technical evaluation went well. The champion is bought in. Legal review begins. And then someone—usually in security, sometimes in compliance, occasionally in IT infrastructure—asks a version of the same question: "Where is the speech data being processed, and can we control that?"
If the answer is "our cloud, and no," the deal is over. Not paused, over. Data residency requirements, regulated industry compliance (HIPAA, SOC 2, FedRAMP), air-gapped network mandates, and enterprise-grade security reviews all point to the same requirement: the vendor needs to be able to run inside the customer's environment, not just gesture toward a compliance page.
Deepgram does not have a credible self-hosted deployment path for most enterprise buyers. This isn't a positioning gap that marketing can bridge. It's a hard capability gap that removes them from consideration in a meaningful slice of the enterprise market. Healthcare systems, financial services firms, government agencies, and defense-adjacent organizations don't negotiate on this requirement. They simply move to vendors who can meet it.
AssemblyAI's self-hosted offering is production-ready. Universal-3 Pro and the full Voice Agent API stack can be deployed in your cloud environment, on-premises, or in air-gapped infrastructure with full API parity with the hosted version. You get the same model quality, the same latency profile, the same context injection capabilities without data leaving your environment.
This isn't a niche feature. For any team selling into regulated industries or large enterprise, it's the question that determines whether you're in the deal. And it's a capability that, until recently, many voice agent builders didn't know was on the table. That's a solvable problem, but only if you ask.
The practical implication: if you're building a voice agent that will eventually be sold to enterprise customers, the deployment architecture question needs to be answered before you're in the room with a security reviewer, not during. "Can we deploy this in our environment?" should be a question you can answer yes to. Right now.
Ceiling 3: what your pipeline hears is a choice, not a given
This one is the most underestimated, and arguably the most immediately addressable.
Most voice agent builders treat transcription accuracy as a fixed property of the model they chose. Audio goes in. Transcript comes out. Whatever accuracy you get is the accuracy you get. The work of improving it feels like a model-selection problem—find a better vendor, swap it in.
It's not. Transcription accuracy in production is significantly a function of context—and most builders aren't using it.
Here's how the gap actually works: when a user calls in and says "I need to change my subscription on account RX-7704132," a base transcription model will often miss or mangle that account number. It's never seen it before. The probability mass for what follows "account" doesn't include that specific alphanumeric string. The model guesses wrong. The LLM receives a broken transcript. The agent either hallucinates a resolution or fails to complete the task.
Now add context. Before the session begins, you inject the user's account number, their product category, the specific vocabulary domain relevant to this call. During the session, previous turns feed back into the transcription model—not just the LLM, but the STT layer itself. The model now has a prior for what it's likely to hear. Accuracy on the exact inputs that matter—proper nouns, domain terms, alphanumeric strings—improves materially. Not because the model is different, but because it has information it didn't have before. The same logic applies to turn detection and speaker diarization, both of which sharpen with richer context.
Universal-3 Pro Streaming supports native context chaining. Keyterms, domain vocabulary, and prior turn context can be injected dynamically into the streaming transcription pipeline. In controlled tests on domain-specific inputs, the improvement from keyterm injection alone moves entity accuracy from the 70–75% range to above 90%. That's not a better model. It's the same model, used correctly.
Noise cancellation compounds this further. In real-world audio environments—a call center floor, a hospital ward, a conference venue, a user calling from a moving car—the signal quality your model receives is degraded before context even comes into play. Audio isolation preprocessing cleans the signal before it reaches the STT layer. The model gets cleaner input and better context simultaneously. The accuracy improvement is multiplicative, not additive. Speech understanding layers on top of this to extract intent, entities, and sentiment downstream.
We saw this play out in real conditions at the SF Voice Agents API hackathon this year. Developers were building and testing voice agents in a loud, crowded venue—exactly the conditions that would break a naive voice pipeline. What we heard afterward from attendees wasn't frustration with audio quality—it was genuine surprise that it worked as well as it did. One striking observation from a technical leader at a major voice AI company: "A hackathon like this wouldn't have been possible a year ago."
That's not a product claim. That's a benchmark in the conditions that actually matter.
The practical path here is a before/after evaluation using your own production audio. Pull a representative sample—ideally including your worst-performing calls, the transcriptions that produce bad downstream LLM behavior. Run them through the base model. Then run them again with your domain vocabulary injected as keyterms and conversation history chained through. Measure the entity accuracy delta. In our experience with teams running this test, the result is usually surprising enough that the follow-up question is "why weren't we doing this already?" For a broader primer on the architecture, see our voice agents guide.
The ceiling is real. It's not inevitable.
The three problems above aren't speculative. They're the specific points where we've watched production voice agent systems run out of room: a user base that doesn't sound like the training data, an enterprise prospect whose security team asks the on-prem question, an audio environment that the pipeline wasn't designed for. Teams shipping meeting intelligence and AI notetakers hit the same walls in a slightly different order.
None of them require you to rebuild your stack. They require you to run the right tests, surface the right gaps, and make a clear-eyed decision about whether your current vendor can close them.
Here's where we'd start:
- On language and dialect coverage: Pull your worst-performing transcriptions. Isolate by language variant, domain vocabulary, and speaker accent. Run the same audio against AssemblyAI Universal-3 Pro and compare entity accuracy directly. The benchmark is available. The comparison is yours to run.
- On self-hosted deployment: If you're in or approaching enterprise sales, the question isn't whether you'll be asked about deployment options—it's when. Get ahead of it. AssemblyAI's self-hosted documentation is the place to start; reach out to the team for an architecture conversation if you're evaluating a specific compliance requirement.
- On context injection: Read the Universal-3 Pro documentation on keyterm injection and context chaining. Then run the before/after test on a sample of your own audio—specifically the transcriptions that currently produce bad LLM behavior downstream. The delta will tell you whether this moves the needle for your use case. Streaming speech-to-text pricing is on the pricing page if you want to size a rollout.
The production ceiling isn't a verdict on the choice you made. It's an invitation to make a more informed one.
FAQs
What is the "production ceiling" for voice agents?
The production ceiling is the point where the speech infrastructure decisions you made to ship quickly start constraining what your voice agent can become. It usually surfaces as three specific failures: poor accuracy on accents and domain terms, no path to self-hosted or BAA-backed deployment, and an STT layer that ignores available context. None of these are model bugs—they're architectural limits.
How do I measure voice agent transcription accuracy in production?
Run a structured evaluation on your real production audio, not a curated sample. Focus on entity miss rate—how often the model fails on proper nouns, alphanumeric strings, domain terminology, and accented speech—rather than overall word error rate, since entities are what your LLM needs to act correctly. Pull your worst-performing calls and benchmark them across vendors directly.
Can AssemblyAI be deployed self-hosted or on-premises?
Yes. Universal-3 Pro Streaming and the Voice Agent API stack can run in your cloud, on-prem, or in air-gapped infrastructure with full API parity to the hosted version. This unblocks data residency, regulated industry, and security-review requirements that disqualify cloud-only vendors. Talk to sales for an architecture conversation tied to your specific compliance posture.
Does AssemblyAI offer a BAA for healthcare voice agents?
AssemblyAI is considered a business associate under HIPAA and offers a standard Business Associate Addendum (BAA) for customers processing protected health information (PHI). The BAA is required under HIPAA for covered entities and their business associates to use AssemblyAI services with PHI. Contact sales to execute a BAA.
How does context injection improve voice agent accuracy?
Context injection feeds domain vocabulary, keyterms, and prior conversation turns directly into the streaming transcription layer—not just the LLM downstream. The model gets a prior for what it's likely to hear, so account numbers, product codes, and named entities resolve correctly instead of being guessed. In controlled tests, keyterm injection alone moves entity accuracy from the 70–75% range to above 90% on domain-specific inputs.
How does Universal-3 Pro Streaming compare to Deepgram Nova-3?
In independent evaluation on mixed-input conditions, Universal-3 Pro Streaming missed 16.7% of entities versus 25.5% for Deepgram Nova-3—a roughly 35% relative reduction in entity miss rate on the inputs that most often break voice agents. Universal-3 Pro Streaming also supports self-hosted deployment and native context chaining, capabilities that aren't broadly available from Deepgram for enterprise buyers.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
.png)
.png)


