Voice Agent API

One API to build voice agents

Stream audio in, get audio back. We handle the rest so you can focus on your product.

Try the Voice Agent API live. This support agent is built on the Voice Agent API — the same one you can ship with. Click to start talking and experience real-time Voice AI in action. Ask about our products, APIs, or docs.

Please note: This agent provides customer support for AssemblyAI products only. Do not share sensitive or non-public information.

AssemblyAI Support Agent
Clinical evaluation history:
00:00
01:59
"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes.  Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Non-speech audio event:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"
Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Speech with disfluencies:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Proper noun spelling:
00:00
01:59
"keyterms_prompt": ["Kelly Byrne-Donoghue"]
Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Caputuring speaker roles:
00:00
01:59
"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}
With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?

Speaker B: Oh yeah, yeah.

Speaker  A: Good.

Speaker B: Every evening.

Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?

Speaker [Patient]: Oh yeah, yeah.

Speaker  [Nurse]: Good.

Speaker [Patient]: Every evening.

Speaker [Nurse]: And no side effects with it?

Spanish and english audio:
00:00
01:59
"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").
Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

Purpose-Built for Speech

The most accurate voice agent,
where it matters most

Universal-3 Pro Streaming gets the hard stuff right — emails, phone numbers, order IDs, names. The things that let your voice agent complete the tasks your customers need.

00:00 / 01:34

AssemblyAI Voice Agent API
Accurate transcript

Agent Thanks for calling the prescription refill line. This is Priya. Can I have your date of birth and the RX number on the bottle?

caller Date of birth is 10/10/51, and the prescription is RX-7704132. It’s Metropolol 80mg.

Agent Got it. I’m also seeing a standing order from Dr. Chen for epinephrine, 0.25 milligrams, 1:1,000 IM. Do you want that refilled too?

Agent Yes, please. And can you send it to the address on file, 10631 Northeast Knott Street, Portland, Oregon 97220?

Agent Confirmed. I’ll route it to your MyChart, and you’ll get a text at (971) 235-7292 when it’s all ready, around 08:30 tomorrow.

Deepgram Voice agent API

Agent Thanks for calling the prescription refill line. This is Priya. Can I have your date of birth and the RX number on the bottle?

caller Date of birth is 10/1051, and the prescription is dash seven seven zero four one three two. It’s metoprolol eighty milligrams.

Agent Got it. I’m also seeing a standing order from Dr. Chen for epinephrine point two five milligrams one to thousand I’m. Do you want that refilled too?

Caller Yes, please. And can you send it to the address on file, 10631 Northeast Knott Street, Portland, Oregon 97220?

Agent Confirmed. I’ll route it to your MyChart, and you’ll get a text at (971) 235-7292 when it’s all ready, around o 08:30 tomorrow.

Voice Experience

Conversations that flow naturally

A proprietary Voice AI stack built end-to-end for speech, so every layer is tuned for how people actually talk.

  • The most accurate voice agents on the market
    Powered by our proprietary Voice AI models like Universal-3-Pro Streaming
  • Clean interruption handling + turn detection
    Speech-aware VAD knows the difference between "I'm thinking" and "I'm done." Your agent stops cutting people off.
  • ~1 second response time
    Fast enough that the rhythm of conversation holds.
  • Built for what's coming
    Because we own the stack, improvements across speech understanding, reasoning, and voice generation ship as one product.
DEVELOPER EXPERIENCE

The fastest way from idea 
to working voice agent

One WebSocket. A handful of JSON types. Most developers ship the same day.

  • Standard JSON API
    No SDKs, no frameworks, no new billing dashboard. The same primitives you already know.
  • Live configuration updates
    Update system prompt, voice, tools, and VAD settings mid-conversation. Change anything and see it instantly.
  • Tool calling integrations
    Register any function with JSON Schema. The agent calls it when appropriate — look up an account, check an order, trigger a workflow.
  • Session resumption
    Reconnect within 30 seconds if the WebSocket drops. Context preserved, conversation continues.
How we compare

See how AssemblyAI stacks up against the other voice agent API options.

AssemblyAI API

Voice Agent API

$4.50/hr

OpenAI

Realtime API

~$18/hr

Deepgram

Voice Agent API

~$4.50/hr

ASR model

Universal-3 Pro Streaming 

#1 Wer

Gpt-realtime

Deepgram Nova-3

Alphanumeric accuracy
(Missed error rate)

16.7%

23.3%

25.5%

Billing model

Flat hourly, no commitments

Per-token audio

Flat hourly; commitments required

Language Support

EN, ES, FR, DE, IT, PT

99+ languages (low accuracy)

EN, ES, NL, FR, DE, IT, JA

End-to-end latency

~1 second

~1 second

~1–1.5 seconds

Turn detection

Speech-aware VAD

Basic

Basic

Turn detection tuning

Semantic + Neural Network + VAD

Semantic VAD and traditional VAD

Traditional VAD only

Mid-session updates

Prompt + voice + tools + turn detection

Prompt + tools only

Prompt + voice only

Session resumption

30s reconnect window

Barge in

Intelligent interruption

VAD based interruptions

VAD based interruptions

Tool calling behavior

Handles with intermediate speech

Goes silent

Goes silent

Pricing estimates based on publicly available data as of 2026. Actual costs vary by usage pattern.

Live Demo

Don't take our word for it. Talk to it.

The best way to evaluate a voice agent platform is to have a conversation with one. Try the live demo — no signup required.

Try the Voice Agent API live. This support agent is built on the Voice Agent API — the same one you can ship with. Click to start talking and experience real-time Voice AI in action. Ask about our products, APIs, or docs.

Please note: This agent provides customer support for AssemblyAI products only. Do not share sensitive or non-public information.

AssemblyAI Support Agent
Clinical evaluation history:
00:00
01:59
"prompt": "Produce a transcript for a clinical history evaluation. It's important to capture medication and dosage accurately. Every disfluency is meaningful data. Include: fillers (um, uh, er, erm, ah, hmm, mhm, like, you know, I mean), repetitions (I I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without prompting

"I just want to move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I do. I take Ramipril. Okay. And I take Metformin, and there's another one that begins with G for the diabetes.  Glicoside."

With context aware prompting

"I just wanna move you along a bit further. Do you take any prescribed medicines? I know you've got diabetes and high blood pressure. I, I do. I take, um, I take Ramipril. Okay, mhm. And I take Metformin, and there's another one that begins with G for the diabetes. So glycosi — glycosi— glycoside."

Non-speech audio event:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: Tag sounds: [beep]"
Without audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options."

With audio tagging

"Your call has been forwarded to an automatic voice message system. At the tone, please record your message. When you have finished recording, you may hang up or press 1 for more options. [beep]"

Speech with disfluencies:
00:00
01:59
"prompt": "Produce a transcript suitable for conversational analysis. Every disfluency is meaningful data. Include: fillers (um, uh, er, ah, hmm, mhm, like, you know, I mean), repetitions (I I, the the), restarts (I was- I went), stutters (th-that, b-but, no-not), and informal speech (gonna, wanna, gotta)"
Without disfluency prompting

Do you and Quentin still socialize when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we're friends. What do you do with him?

With disfluency prompting

Do you and Quentin still socialize, uh, when you come to Los Angeles, or is it like he's so used to having you here? No, no, no, we, we, we're friends. What do you do with him?

Proper noun spelling:
00:00
01:59
"keyterms_prompt": ["Kelly Byrne-Donoghue"]
Without keyterms prompting

"Hi, this is Kelly Byrne Donahue"

Without keyterms prompting

"Hi, this is Kelly Byrne-Donahue"

Caputuring speaker roles:
00:00
01:59
"prompt": "Produce a transcript with every disfluency data. Additionally, label speakers with their respective roles. 1. Place [Speaker:role] at the start of each speaker turn. Example format: [Speaker:NURSE] Hello there. How can I help you today? [Speaker:PATIENT] I'm feeling unwell. I have a headache."}
With traditional speaker labels

Speaker A: 5Mg. And do you take it regularly?

Speaker B: Oh yeah, yeah.

Speaker  A: Good.

Speaker B: Every evening.

Speaker A: And no side effects with it?

With speaker labels prompting

Speaker [Nurse]: 5Mg. And do you take it regularly?

Speaker [Patient]: Oh yeah, yeah.

Speaker  [Nurse]: Good.

Speaker [Patient]: Every evening.

Speaker [Nurse]: And no side effects with it?

Spanish and english audio:
00:00
01:59
"language_detection": True
"prompt": Preserve natural code-switching between English and Spanish. Retain spokenlanguage as-is (correct "I was hablando con mi manager").
Without codeswitching

Would definitely think I spoke Spanish if you heard me speak Spanish. But I still make mistakes. Soy wines. Paltro Soy. La fundadora de goop. Thank you. Thank you for doing that.

With codeswitching

You would definitely think I spoke Spanish if you heard me speak Spanish, but I still make mistakes. Soy Gwyneth Paltrow, soy la fundadora de Goop. Thank you. Thank you for doing that.

Powered by AssemblyAI Voice Agent API · Using Universal-3 Pro Streaming
Build Anything

Invisible infrastructure for your voice product

Your customers should feel like you built it. Full control over conversation design, tool integrations, and agent behavior.

Support

Customer Support

Agents that resolve tickets, look up accounts, and escalate intelligently.
Consumer

AI Companions

Conversational experiences that feel natural and remember context.
Healthcare

Clinical Workflows

Voice interfaces for intake, triage, and documentation — with accurate medical term recognition.
Education

Language Learning

Practice conversations in 6 languages with real-time feedback.
Telephony

Phone Agents

Voice agents for inbound and outbound calls. Works with phone-based and in-app experiences.
Training

Coaching & Training

Interactive voice sessions for sales training, onboarding, and skill development.

More on Voice Agents

Voice Agent Solutions

Create voice experiences that feel more intuitive and responsive while maintaining the flexibility to optimize for your unique requirements.

Learn more

Using an orchestrator?

Universal-3 Pro Streaming gives your voice agents the accuracy, speed, and real-time control to handle real conversations at scale.

Learn More

Start Building

Explore our comprehensive docs with integration guides and best practices to optimize accuracy adn latency for your application.

Read the docs

Read the blog

Learn more about the Voice Agent API by reading our blog. Get the technical details and product thinking behind every AssemblyAI release.

Read the blog

Frequently Asked Questions

What is AssemblyAI's Voice Agent API and how does it work?

AssemblyAI's Voice Agent API is a single WebSocket API that handles the full voice agent pipeline — speech understanding, LLM reasoning, voice generation, turn detection, and interruption handling — so developers can stream audio in and get audio back without stitching together separate services. It's powered by Universal-3 Pro for industry-leading speech recognition accuracy and supports tool calling, live mid-conversation configuration updates, and 10-second session resumption. The API currently supports English, Spanish, French, German, Italian, and Portuguese.

How much does the Voice Agent API cost compared to OpenAI and Deepgram?

The Voice Agent API costs a flat $4.50 per hour, covering the entire speech-to-speech pipeline with no per-token surcharges or concurrency caps. The OpenAI Realtime API costs roughly $18 per hour and bills per token across 30+ event types, while Deepgram's voice agent offering is also $4.50 per hour but uses concurrency-metered billing. AssemblyAI's flat-rate model means predictable costs at any scale — from a single call to thousands of concurrent sessions.

Why does speech accuracy matter for voice agents?

Transcription errors cascade through every downstream step — if your agent mishears an order number, email address, or customer name, it can't complete the task. Universal-3 Pro gets the hard stuff right: mixed-entity tokens like confirmation codes, phone numbers, and proper nouns that voice agents need to act on. In head-to-head comparisons, it delivers 92.7% mixed-entity accuracy and handles natural pauses through semantic turn detection rather than basic silence-based VAD.

What can I build with the Voice Agent API?

Common use cases include customer support agents that look up orders and resolve issues by voice, AI companions, clinical workflow assistants (with Medical Mode for healthcare terminology accuracy), phone-based agents for appointment scheduling and lead qualification, language learning tools with real-time feedback, and coaching and training applications. The API's native tool calling lets your agent execute functions — lookups, payments, workflows — mid-conversation without breaking the dialogue flow.

How do I get started building with the Voice Agent API?

Sign up for a free account, grab an API key, and connect via WebSocket using standard JSON messages — no proprietary SDK required. Most developers ship a working demo same day. For a hands-on walkthrough, explore the developer docs. You can also talk to the live demo on the product page to experience the API's accuracy and latency firsthand.

Should I build with the Voice Agent API or AssemblyAI's Streaming Speech-to-Text?

The Voice Agent API is the fastest path to production—one integration, sub-second latency, no stitching required. But if you're already invested in an orchestrator like LiveKit or Pipecat, you can use Universal-Streaming as the STT layer in that stack. Both work. The API just gets you there faster.

Unlock the value of voice data

Build what’s next on the platform powering thousands of the industry’s leading of Voice AI apps.