Roadmap
Pre-recorded Speech-to-Text API #
Speech-to-text for pre-recorded audio.
Upcoming
-
The next Universal-3 Pro release. Native-language coverage grows from 6 to 19, with accuracy gains on the six core languages.
A new model option priced the same as u3-pro, not an automatic drop-in. The current model stays available as a pinned snapshot.
New languages: Japanese, Vietnamese, Arabic, Dutch, Swedish, Hindi, Norwegian, Finnish, Danish, Urdu, Hebrew, Mandarin, Turkish.
-
Universal-3 Pro transcription roughly twice as fast end-to-end. Phase one is live, already cutting turnaround time 30-80%.
-
Much better speaker labeling in noisy, multi-speaker audio. Targets the two most-reported errors: mislabeled short replies like “yeah” and “uh-huh”, and speaker turns that don’t line up with punctuation.
A new model option priced the same as u3-pro. The current model stays available as a pinned snapshot.
-
Open-source speech-to-text models served directly through our API, for the languages and domains they specialize in.
-
Recognize the same speaker across recordings, not just within a single file. For meetings, call centers, and cross-session analytics.
-
Synchronous HTTP transcription on u3-pro, ~134ms p50 latency, for voice-agent calls that need a transcript in a single request.
-
On-premise universal-3-pro for regulated environments with strict data-residency requirements.
-
The next major accuracy and capability release after Universal-3.5 Pro.
Recently shipped
-
Universal 3 Pro Async Timestamp Improvements # — Better Universal-3 Pro timestamps: median precision up 15.3% for English and 8.6% for non-English, with P99 gains of 15.0% and 58.4%.
-
Hebrew & Swedish # — Accuracy gains in Hebrew and Swedish via community models. Word error rates down 37% and 47%.
-
Medical Mode # — LLM-powered correction for medical terminology: 4.97% error rate versus 7.32% for the next-best vendor. Add-on to Universal-3 Pro in English, Spanish, German, French, Portuguese, and Italian.
-
PII Audio Redaction using Silence # — Redact PII with silence instead of a beep, reducing listener fatigue when redacted audio is replayed at scale.
-
Universal 3 Pro Async # — Promptable speech-to-text with natural-language and custom-vocabulary prompts, mid-sentence language switching across six core languages, and audio tagging.
-
Improved Short-Audio Diarization # — 19% better speaker-count accuracy and 6% lower speaker-attributed word error rate on audio under two minutes.
-
Multichannel Diarization # — Per-channel speaker labels for multi-microphone recordings, eliminating crosstalk ambiguity in call-center and meeting audio.
Realtime Speech-to-Text API #
Low-latency streaming speech-to-text for live audio.
Upcoming
-
Our new flagship realtime model gives every turn the context a conversation actually has. The model takes direction from your agent, keeps a rolling memory of the call on its own, and hears the speaker instead of the room, so spelled-out emails, account IDs, and one-word confirmations come out right. It runs in 19 languages at flagship accuracy with mid-sentence code-switching, plus a new language_code parameter to commit to one language when you already know it.
A new model option priced the same as u3-pro, not an automatic drop-in. The current model stays available as a pinned snapshot.
New languages: Japanese, Vietnamese, Arabic, Dutch, Swedish, Hindi, Norwegian, Finnish, Danish, Urdu, Hebrew, Mandarin, Turkish.
-
A fast, cost-efficient realtime model for notetaking and meeting intelligence. Tuned for long-form audio where throughput and stable accuracy over multi-hour sessions matter more than latency.
-
On-premise universal-realtime-3-pro for regulated environments with strict data-residency requirements.
-
The next-generation realtime model for voice agents. Targets the lowest turn latency and the strongest handling of voice-agent audio (noise, interruptions, hesitation, accented speech), with instruction-following strong enough to replace today’s STT + LLM + TTS stack. Multilingual across 15+ native languages and the foundation for our speech-to-speech architecture.
-
A single native model that replaces today’s Voice Agent pipeline (STT, LLM, TTS) with a unified Realtime Speech LLM. Tighter latency, better prosody, and more natural interruption handling than orchestrated stacks.
Recently shipped
-
Voice Focus # — Realtime noise suppression for voice agents and telephony, so accuracy holds up in real call-center conditions with no separate preprocessor.
-
Streaming Modes # — min_latency, balanced, and max_accuracy presets to tune the latency/accuracy trade-off per workload.
-
Context Carryover # — Universal-3 Pro Streaming carries prior finalized turns forward as context to improve accuracy, on by default. Optionally pass your voice agent’s spoken reply via agent_context so the model knows the question the user is answering.
-
Streaming Speaker Revision # — An end-of-stream SpeakerRevision message returns corrected speaker labels at async-parity cpWER, for roughly 400ms of added latency.
-
Streaming PII Redaction # — PII detection and redaction in the realtime pipeline for HIPAA, PCI, and similar workloads. Configurable entity types and substitution modes.
-
Medical Mode # — LLM-powered correction for medical terminology: 4.97% error rate versus 7.32% for the next-best vendor. Async and streaming on universal-realtime-3-pro.
-
Streaming Diarization v1.5 # — Speaker-aware sentence splitting: 4-5% lower word error rate, 56% fewer phantom speakers, and gains on the CallHome and AMI benchmarks.
-
Universal 3 Pro Realtime # — Realtime speech-to-text with inline speaker labeling, custom vocabulary up to 1,000 words, audio tagging, filler-word control, mid-sentence language switching, and 99+ languages via Whisper routing. EU region support.
-
Edge Routing and Data Zone Endpoints # — Global low-latency routing with US/EU data-residency endpoints. No additional charge.
Voice Agent API #
End-to-end Voice Agent API.
Upcoming
-
Direct Twilio SIP and voice connectivity, without customer-side LiveKit plumbing.
-
Full programmatic control of voice agents at the edge. Create, version, and manage agents as code, persist and retrieve sessions (events, transcripts, tool calls), and run webhooks and tool calls at the edge instead of round-tripping to origin.
-
Official client libraries, starting with Python and TypeScript.
-
A custom turn-detection model trained on Universal 3 Pro streaming output. Reduces false endpointing and handles pauses, hesitations, and overlapping speech better.
Recently shipped
-
Voice Agent API # — Production release of the Voice Agent API (formerly Speech-to-Speech API), built on universal-realtime-3-pro, LLM Gateway, and TTS on self-hosted LiveKit. PCI-certified.
-
Voice Agent Preview # — First public release of end-to-end voice AI, combining universal-realtime-3-pro, LLM Gateway, and TTS on LiveKit.
TTS #
Text-to-speech built for voice agents.
Upcoming
-
A standalone text-to-speech model for production voice workloads. Low time-to-first-byte, voice prompting, and accurate delivery of phone numbers, emails, and named entities that today’s TTS struggles with.
-
Open-source text-to-speech models served through our API alongside Universal TTS, for the languages, styles, and domains they specialize in.
Speech Understanding API #
Extract meaning, sentiment, and events from audio.
Upcoming
-
Summarization via the LLM Gateway, replacing legacy LeMUR summaries, using frontier models with automatic fallbacks.
-
Sharper chapter boundaries and titles via the LLM Gateway, with better topic segmentation on long-form content.
-
Accuracy improvements to Speaker ID, Translation, and Custom Formatting. Translation covers both streaming and pre-recorded audio, for workflows where the spoken language differs from the output.
-
Static find-and-replace redaction (redact_pii_static_entities) for known terms, with LLM-powered category redaction following in Q3/Q4.
-
Automatic LLM-based transcript correction with no user prompts, generalizing the Medical Mode pattern to any domain.
-
Better keyterm prompting in every supported language via LLM Gateway post-processing, closing the gap with English for Spanish, German, French, Portuguese, and Italian.
-
Detect speaker emotions and shifts in input audio, distinct from TTS-side Emotion and Style Tagging. For therapy, CX scoring, and compliance monitoring.
LLM Gateway #
One API for every major LLM. Built-in fallbacks and audio-first integration.
Upcoming
-
Ongoing catalog expansion. The Gateway supports 24 models across Anthropic, OpenAI, Google, Qwen, and Kimi, with DeepSeek, Mistral, Llama, and Cohere next.
-
Priority, standard, and flex request tiers for per-request cost and latency control.
Recently shipped
-
Global Routing # — An opt-in model_region: global setting that routes to lower-cost capacity for Claude calls. Gemini 3 series coming soon.
-
Claude Opus 4.8, Gemini 3.5 Flash, Gemini 3.1 Flash Lite (GA) # — Three new models added to the catalog, available through the Gateway on day one.
-
Reasoning Mode # — Reasoning and effort controls exposed through the Gateway for OpenAI-compatible, Gemini 3+, and Anthropic models.
-
Prompt Caching # — Prompt-cache pass-through, so customers keep the cache discount while routing through the Gateway.
-
Automatic Model Fallbacks # — The Gateway retries failed requests against a configurable fallback model, so single-provider outages don’t surface as customer-facing failures.
-
Claude Opus 4.5 and 4.6 # — Anthropic’s most capable models, available through the Gateway on day one.
Open Benchmarks #
Transparent, reproducible benchmarks across Universal and community models.
Upcoming
-
A public version of our internal evaluation dashboard, covering 30+ competitors and the community models we serve across dozens of metrics. Verify accuracy and pick the right model per language and domain.
-
Open-source release of the dashboard and tooling, so anyone can reproduce our benchmarks on their own data.
-
A realistic Voice-AI evaluation set of telephony, meeting, and voice-agent audio with ground-truth transcripts, used to grade every model on the leaderboard. Not a thirty-second YouTube clip or LibriSpeech.
Developer Experience #
The dashboard, accounts, and tooling that make AssemblyAI easy to adopt.
Upcoming
-
SAML and OIDC, as a fast-follow to multi-user accounts.
-
Guided setup for new accounts: model selection, API key creation, a first-request walkthrough, and best-practice defaults.
-
Hard spend caps and team budgets, beyond today’s soft alerts.
-
Deeper dashboard observability: P50 and P95 turnaround time, webhook delivery stats, uptime, and latency histograms.
-
Programmatic access to billing and usage data.
-
Configurable alert thresholds and cadences, including daily alerts and custom triggers.
-
Card support for EU PSD2 Strong Customer Authentication.
Recently shipped
-
Multi-User Accounts # — Invite teammates with role-based access. GA, with RBAC, MFA enforcement, member management, account switching, and ownership transfer.
-
AssemblyAI Skill for AI Coding Agents # — Claude Code, Cursor, and Codex now ship with a native AssemblyAI skill, giving them accurate API knowledge and cutting hallucinated API usage in generated code.