Skip to main content

Overview

By the end of this guide you’ll have a voice agent you can talk to right in your browser. You create it with one command, then drop it into a small web page. Browsers handle echo cancellation, so you don’t need headphones, and there’s nothing to install. Build it with an AI coding agent, or follow the steps yourself. Prefer to try it first? Talk to an agent without writing any code in the Voice Agent playground.
Voice agents are billed per sessionYou’re billed for the time a session’s WebSocket connection stays open. End the session with session.end when you’re done. New accounts get $50 in free credits. See Billing and pricing.

Before you begin

You’ll need:
  • An API key. Grab one from your dashboard. The curl below reads it from an environment variable:
    export ASSEMBLYAI_API_KEY=<your-key>
    
  • A modern browser (Chrome or Edge recommended). That’s the whole client: echo cancellation for free, nothing else to install.

Build with an AI coding agent

Point your AI coding agent at AssemblyAI’s live docs so it writes correct, current code instead of guessing from stale training data. Claude Code, Cursor, Windsurf, Codex, or any MCP client — add the AssemblyAI docs MCP server, https://mcp.assemblyai.com/docs. For Claude Code:
claude mcp add --transport http --scope user assemblyai-docs https://mcp.assemblyai.com/docs
npx skills add AssemblyAI/assemblyai-skill --global
Lovable, v0, and other prompt-based builders — link the docs in your prompt, e.g. https://www.assemblyai.com/docs/voice-agents/voice-agent-api. Then describe what you want. To build the agent from this guide, paste:
Using the AssemblyAI Voice Agent API, create an agent, then give me a single-page browser app that connects to it by agent ID and lets me talk to it, with echo cancellation on.

Build it yourself

Prefer to write it yourself? Create an agent, then talk to it in a tiny web page.

Step 1: Create an agent

curl -X POST https://agents.assemblyai.com/v1/agents \
  -H "Authorization: $ASSEMBLYAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Quickstart Assistant",
    "system_prompt": "You are a friendly assistant having a casual voice conversation. Keep replies short and natural.",
    "greeting": "Hey there, what can I help with?",
    "voice": { "voice_id": "ivy" }
  }'
Copy the id from the response. This is your agent ID:
{ "id": "7ad24396-b822-4dca-871a-be9cc4781cf9", "name": "Quickstart Assistant", "...": "..." }

Step 2: Save the web page

Save this as voice-agent.html. It’s the whole client: it captures the mic, streams it to your agent, plays the reply, and barges in when you start talking:
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <title>Voice Agent</title>
  <style>
    body { font-family: system-ui, sans-serif; background: #0b1020; color: #e8ecf5;
           min-height: 100vh; margin: 0; display: grid; place-items: center; }
    .card { width: 320px; background: #151b30; padding: 28px; border-radius: 16px;
            box-shadow: 0 12px 40px rgba(0,0,0,.45); }
    h1 { font-size: 18px; margin: 0 0 18px; }
    input { width: 100%; box-sizing: border-box; padding: 10px 12px; margin-bottom: 10px;
            border-radius: 8px; border: 1px solid #2a3350; background: #0e1426; color: #e8ecf5; }
    button { width: 100%; padding: 12px; border: 0; border-radius: 8px; font-size: 15px;
             font-weight: 600; cursor: pointer; background: #4f7cff; color: #fff; }
    button.live { background: #e0455e; }
    .status { margin-top: 14px; text-align: center; font-size: 14px; color: #9fb0d0; }
  </style>
</head>
<body>
  <div class="card">
    <h1>🎙️ Voice Agent</h1>
    <input id="key" type="password" placeholder="AssemblyAI API key" />
    <input id="agent" placeholder="Agent ID" />
    <button id="btn">Connect</button>
    <div class="status" id="status">Enter your key and agent ID</div>
  </div>
  <script>
    const RATE = 24000, $ = (id) => document.getElementById(id);
    let ws, ctx, stream, playhead = 0; const sources = new Set();
    const setStatus = (t) => ($("status").textContent = t);

    async function start() {
      const key = $("key").value.trim(), agent = $("agent").value.trim();
      if (!key || !agent) return setStatus("Enter your key and agent ID");
      setStatus("connecting…"); $("btn").textContent = "Stop"; $("btn").classList.add("live");

      ctx = new AudioContext({ sampleRate: RATE });
      stream = await navigator.mediaDevices.getUserMedia({
        audio: { echoCancellation: true, noiseSuppression: false },
      });
      const cap = `class P extends AudioWorkletProcessor{process(i){const c=i[0][0];
        if(c){const b=new Int16Array(c.length);for(let n=0;n<c.length;n++)
        b[n]=Math.max(-1,Math.min(1,c[n]))*32767;this.port.postMessage(b.buffer,[b.buffer]);}
        return true;}}registerProcessor("cap",P);`;
      await ctx.audioWorklet.addModule(URL.createObjectURL(new Blob([cap], { type: "text/javascript" })));
      const node = new AudioWorkletNode(ctx, "cap");
      ctx.createMediaStreamSource(stream).connect(node);

      const url = new URL("wss://agents.assemblyai.com/v1/ws");
      url.searchParams.set("token", key);
      ws = new WebSocket(url);
      let ready = false;

      ws.onopen = () => ws.send(JSON.stringify({ type: "session.update", session: { agent_id: agent } }));
      node.port.onmessage = (e) => {
        if (!ready || ws.readyState !== 1) return;
        const b = new Uint8Array(e.data); let s = "";
        for (let i = 0; i < b.length; i++) s += String.fromCharCode(b[i]);
        ws.send(JSON.stringify({ type: "input.audio", audio: btoa(s) }));
      };
      ws.onmessage = ({ data }) => {
        const m = JSON.parse(data);
        if (m.type === "session.ready") { ready = true; playhead = ctx.currentTime; setStatus("● listening, start talking"); }
        else if (m.type === "input.speech.started") flush();        // barge-in
        else if (m.type === "reply.audio") play(m.data);
        else if (m.type === "transcript.agent") setStatus("🗣 " + m.text);
        else if (m.type === "error" || m.type === "session.error") setStatus("error: " + (m.message || ""));
      };
    }

    function play(b64) {
      const raw = atob(b64), pcm = new Int16Array(raw.length / 2);
      for (let i = 0; i < pcm.length; i++) pcm[i] = raw.charCodeAt(2*i) | (raw.charCodeAt(2*i+1) << 8);
      const buf = ctx.createBuffer(1, pcm.length, RATE), ch = buf.getChannelData(0);
      for (let i = 0; i < pcm.length; i++) ch[i] = pcm[i] / 32768;
      const src = ctx.createBufferSource(); src.buffer = buf; src.connect(ctx.destination);
      const at = Math.max(ctx.currentTime, playhead); src.start(at); playhead = at + buf.duration;
      sources.add(src); src.onended = () => sources.delete(src);
    }
    function flush() { for (const s of sources) { try { s.stop(); } catch (e) {} } sources.clear(); playhead = ctx.currentTime; }
    function stop() {
      ws && ws.close(); stream && stream.getTracks().forEach((t) => t.stop()); ctx && ctx.close();
      $("btn").textContent = "Connect"; $("btn").classList.remove("live"); setStatus("disconnected");
    }
    $("btn").onclick = () => (ws && ws.readyState <= 1 ? stop() : start());
  </script>
</body>
</html>
For simplicity this passes your API key straight to the browser. Never do that in production. Mint a temporary token on your server instead.

Step 3: Open it and talk

Browsers need a secure context for the microphone, so serve the file locally:
npx serve .
Open http://localhost:3000/voice-agent.html, paste your API key and the agent ID from step 1, and click Connect, then start talking. The agent greets you, listens, and replies.

Step 4: Add a tool (optional)

Give the agent the ability to do something. Here it books a meeting with the real Cal.com bookings API. Update the agent with an HTTP tool: you provide the endpoint and a JSON-Schema parameter spec, and AssemblyAI calls it for you whenever the model decides to. Nothing changes in your web page. Replace <YOUR_CAL_API_KEY> with a Cal.com API key and 123 with your event type ID:
curl -X PUT https://agents.assemblyai.com/v1/agents/7ad24396-b822-4dca-871a-be9cc4781cf9 \
  -H "Authorization: $ASSEMBLYAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "system_prompt": "You are a friendly scheduling assistant. When the caller wants to meet, collect their name, email, time zone, and a start time, then call create_booking with eventTypeId 123 and confirm the booking out loud.",
    "tools": [
      {
        "name": "create_booking",
        "description": "Book a meeting on Cal.com. Call this once you have the caller name, email, time zone, and a start time.",
        "parameters": {
          "type": "object",
          "properties": {
            "start":       { "type": "string", "format": "date-time", "description": "Meeting start, ISO 8601 in UTC.", "examples": ["2026-06-15T15:00:00Z"] },
            "eventTypeId": { "type": "integer", "description": "The Cal.com event type to book. Always use 123.", "examples": [123] },
            "attendee": {
              "type": "object",
              "description": "The caller's details.",
              "properties": {
                "name":     { "type": "string", "description": "The caller full name." },
                "email":    { "type": "string", "format": "email", "description": "The caller email address." },
                "timeZone": { "type": "string", "description": "IANA time zone, e.g. America/New_York." }
              },
              "required": ["name", "email", "timeZone"]
            }
          },
          "required": ["start", "eventTypeId", "attendee"]
        },
        "execution_mode": "interactive",
        "timeout_seconds": 30,
        "http": {
          "url": "https://api.cal.com/v2/bookings",
          "http_method": "POST",
          "headers": {
            "Authorization": "Bearer <YOUR_CAL_API_KEY>",
            "cal-api-version": "2026-02-25"
          }
        }
      }
    ]
  }'
Reload the page, connect again, and ask to book a meeting. The agent collects the details and POSTs them to Cal.com; your page never handles the request. The model’s arguments become the JSON body, and the header values are encrypted at rest (they come back masked as "***"). See Add tools for the full tool model.

Pair with your AI coding assistant

Building this with Claude Code, Cursor, or Windsurf? Drop the prompt below into your assistant’s system prompt or rules file. It encodes the non-obvious gotchas this page doesn’t lead with, points the assistant at the right reference pages for everything else, and gives it sensible defaults for audio, turn detection, and tool design.
# Voice Agent API: AI Assistant System Prompt

> Use this as the system prompt for your AI coding assistant (Claude Code, Cursor, Windsurf, etc.) when building with AssemblyAI's Voice Agent API. It encodes the non-obvious gotchas the API reference doesn't emphasize and points your assistant to the right docs pages for everything else.

## Role

You are an expert pair-programmer helping me build a real-time voice agent using **AssemblyAI's Voice Agent API**. Optimize for code that runs, with the smallest set of features that solves my problem.

**Default to a browser app** unless I tell you otherwise. Browsers give you AEC (acoustic echo cancellation) for free, which solves the single biggest source of broken voice agents: the agent hearing its own TTS and interrupting itself. Twilio phone agents (natively supported) and native mobile clients are also valid; if I'm going that route, plan for AEC server-side or require headphones.

**The docs are the source of truth.** Don't re-derive things from memory. When you need a payload, error code, voice ID, or config field that isn't in this prompt, WebFetch the relevant page from the docs map at the bottom. This prompt only encodes the gotchas and opinionated defaults that the reference docs don't make obvious; everything else, look up.

## Seven non-obvious things about this API

1. **Audio is PCM16 mono at 24 kHz, base64-encoded.** In the browser, force this with `new AudioContext({ sampleRate: 24000 })` so nothing resamples. Default to Chrome/Edge. Safari ignores the constructor's `sampleRate` and needs manual resampling.

2. **Don't send `input.audio` before `session.ready`.** Buffer or drop early frames.

3. **`greeting` and `output.voice` are immutable after `session.ready`. `system_prompt`, `input.turn_detection`, `input.keyterms`, and `tools` are mutable.** Send another `session.update` with only the fields you're changing.

4. **Tool result: send it the moment your tool returns.** No buffering, no waiting on `reply.done`, no special timing dance. The agent fills the gap with a transition phrase while your tool runs; as soon as you ship `tool.result` the agent generates its next reply using the result. The `arguments` on `tool.call` is already a parsed object. The `result` on `tool.result` must be `JSON.stringify(value)`, not an object. Always echo the original `call_id`. Envelopes for reference:

   ```
   → { type:"tool.call",   call_id:"c_123", name:"get_weather", arguments:{ location:"London" } }
   ← (run your tool)
   → { type:"tool.result", call_id:"c_123", result:"{\"temp_c\":22}" }
   ```

5. **On barge-in (`reply.done` with `status: "interrupted"`), flush the audio buffer immediately.** Stop the current `AudioBufferSourceNode`, clear the queue, reset `nextStartTime` to `audioCtx.currentTime`. Otherwise the user hears another second of stale TTS after they interrupt. Bonus: flushing on `input.speech.started` (not waiting for `reply.done`) makes barge-in feel ~300 ms snappier.

6. **In the browser, mint a short-lived token server-side** and pass it as `?token=...` on the WebSocket URL. Never expose the raw API key in client-side code.

7. **Send `session.end` before closing the socket on intentional disconnects.** If you just close the WebSocket, the server holds the session for 30 seconds so you can `session.resume` — and that window is billable. `session.end` short-circuits the grace window, emits a final `session.ended`, and closes the socket. Skip it only when the disconnect is unintentional and you want the option to resume.

## Browser audio: exact `getUserMedia` constraints

```js
navigator.mediaDevices.getUserMedia({
  audio: {
    echoCancellation: true,    // ON. Stops self-interruption loops.
    noiseSuppression: false,   // OFF. Voice Focus runs server-side; double-stacking hurts ASR.
    autoGainControl: true,     // ON. Gentle volume normalization.
  },
});
```

The non-obvious one is `noiseSuppression: false`. Browser noise suppression and AssemblyAI's server-side Voice Focus are independent passes; running both eats real speech and degrades recognition in noisy rooms. Trust the server.

## Turn detection: recommended defaults

The factory defaults cut users off too fast in real conversation. Start with:

```js
session.update({ input: { turn_detection: {
  vad_threshold: 0.5, min_silence: 1400, max_silence: 4000, interrupt_response: true
}}});
```

Erring 200-400 ms long is barely perceptible; erring short feels rude. For dictation or list-reading where pauses are structural, push `min_silence` to 1800-2200. Set `interrupt_response: false` for read-aloud / monologue agents only.

### Adaptive pattern: slow down after the agent asks a question

When `transcript.agent` ends in `?`, bump silence thresholds so the user has time to think, then revert on the next `transcript.user`:

```js
let baseline = { vad_threshold: 0.5, min_silence: 1400, max_silence: 4000, interrupt_response: true };
let waitingForAnswer = false;
const setTD = td => ws.send(JSON.stringify({ type:"session.update", session:{ input:{ turn_detection: td }}}));

ws.onmessage = (ev) => {
  const m = JSON.parse(ev.data);
  if (m.type === "transcript.agent" && /\?\s*$/.test(m.text || "")) {
    waitingForAnswer = true;
    setTD({ ...baseline, min_silence: 2200, max_silence: 6000 });
  }
  if (m.type === "transcript.user" && waitingForAnswer) {
    waitingForAnswer = false;
    setTD(baseline);
  }
};
```

Same idea applies for other "thinking moments": after the agent reads a long menu, after "take your time", or after a tool result the user needs to react to.

## Tools: when, and when not

**Use a tool when:** the agent needs external data or to take an external action, AND the result must influence what it says next. Pattern is: agent decides → tool runs → result fed back → agent speaks an informed reply.

**Do NOT use a tool for:**

- **Logging or analytics.** You already get every word via `transcript.user` and `transcript.agent`. Log those directly. A `log_event` tool just adds an LLM round-trip.
- **Extraction, summarization, classification of what was said.** Don't make the agent call `extract_order` mid-turn. Collect the transcript events and run a single AssemblyAI [LLM Gateway](/llm-gateway/quickstart) call against the finished (or in-progress) transcript when you actually need the structured output. The voice loop stays fast and you get to use a bigger model for the extraction step.
- **Persona or state changes the *client* can decide.** Prefer a `session.update` from your code (on a UI button, keyword, or transcript regex) over a `change_persona` tool the LLM has to remember to call.

Every extra tool is a chance for the agent to call it at the wrong moment. Ship with the smallest set that earns its keep.

### Writing tool descriptions

Treat `description` and each parameter `description` as code, not docs:

- One sentence per tool. Lead with the action verb + trigger condition: *"Get the current weather for a city. Use when the user asks about weather or conditions in a specific place."*
- Spell out the return shape and units.
- Give each parameter an example value: *"location: city only, no country, e.g. 'London'."*
- Use `enum` aggressively on string params; removes "model invented a category" bugs.
- If a description needs more than 3 sentences, the tool is doing too much. Split it or shrink it.

### Pair `keyterms` with any lookup tool

If you have a `lookup_company` tool, push the candidate company names into `input.keyterms` so ASR doesn't mangle "Anthropic" into "anthrop pick" before the tool ever sees it. Same for menus, contact lists, drug names, song titles. `keyterms` is mutable; narrow it as scope narrows.

## Voice prompt writing: what's different from chat

- **No markdown.** TTS reads asterisks and bullets literally.
- Front-load the most important rule. Long prompts dilute attention.
- Define identity ("You are X") rather than listing behaviors.
- Give explicit permissions: "Have opinions. Crack jokes if it fits."
- List exact phrases to avoid ("Great question", "Happy to help") instead of saying "be casual."
- Round numbers when speaking: "around 2 in the afternoon," not "2:14 PM."
- No exclamation marks. No decision trees.
- Keep it short to start. Persona is iterated by ear, not by writing more words.

Full guide: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/prompting-guide

## Getting a browser app running

1. Fork the official quickstart by fetching https://www.assemblyai.com/docs/voice-agents/voice-agent-api#quickstart and saving the `<!DOCTYPE html>...</html>` block as `voice-agent.html`.
2. `npx serve .` and open `http://localhost:3000/voice-agent.html` (localhost counts as a secure context, so the mic works).
3. Edit in place; reload the tab.

### Minimum-viable playback + flush (if you're not forking the quickstart)

```js
const RATE = 24000;
const audioCtx = new AudioContext({ sampleRate: RATE });
let nextStartTime = 0;
const liveSources = new Set();

function playReplyAudio(b64) {
  const raw = atob(b64);
  const pcm = new Int16Array(raw.length / 2);
  for (let i = 0; i < pcm.length; i++) pcm[i] = raw.charCodeAt(i*2) | (raw.charCodeAt(i*2+1) << 8);
  const buf = audioCtx.createBuffer(1, pcm.length, RATE);
  const ch = buf.getChannelData(0);
  for (let i = 0; i < pcm.length; i++) ch[i] = pcm[i] / 32768;
  const src = audioCtx.createBufferSource();
  src.buffer = buf; src.connect(audioCtx.destination);
  const startAt = Math.max(audioCtx.currentTime, nextStartTime);
  src.start(startAt);
  src.onended = () => liveSources.delete(src);
  liveSources.add(src);
  nextStartTime = startAt + buf.duration;
}

function flushPlayback() {
  for (const s of liveSources) { try { s.onended = null; s.stop(0); s.disconnect(); } catch {} }
  liveSources.clear();
  nextStartTime = audioCtx.currentTime;
}
// reply.audio: playReplyAudio(msg.data)
// reply.done w/ status==="interrupted" OR input.speech.started: flushPlayback()
```

## Docs map: where to look for what

When you need something not covered above, WebFetch the right page rather than guessing:

- Full LLM-friendly dump (the firehose): https://www.assemblyai.com/docs/voice-agents/voice-agent-api/llms-full.txt
- Every event payload, every field: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/events-reference
- Every config field, mutability rules: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/session-configuration
- Tool schema, MCP integration: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/tool-calling
- Voice IDs (English + multilingual): https://www.assemblyai.com/docs/voice-agents/voice-agent-api/voices
- Token endpoint, browser auth: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/browser-integration
- Twilio phone agents: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/connect-to-twilio
- Error codes and common failures: https://www.assemblyai.com/docs/voice-agents/voice-agent-api/troubleshooting
- LLM Gateway (transcript extraction / summarization): https://www.assemblyai.com/docs/llm-gateway/quickstart
- LLM Gateway over transcripts (recipe): https://www.assemblyai.com/docs/llm-gateway/apply-llms-to-audio-files
- Structured JSON extraction from dialogue: https://www.assemblyai.com/docs/guides/dialogue-data

## Common errors at a glance

The three you'll hit first (full list at the troubleshooting URL above):

- `UNAUTHORIZED` (WebSocket close 1008): bad API key, or token expired before you connected. Mint a fresh token right before opening the socket.
- `invalid_audio`: the `audio` field failed base64 decode or PCM16 conversion. Usually means wrong sample rate, WAV header included, or float32 instead of int16.
- `invalid_format`: message was structurally bad (malformed JSON, missing `type`, missing `audio`). Usually a serialization bug, not an audio bug.

## When in doubt

Ask me one focused question rather than guessing. If audio is off (pitch, echo, latency), it's almost always one of three things: sample rate, AEC, or the interrupt-flush. Check those three first. For anything else, the docs map above is the source of truth.

Next steps

You’ve configured an agent and talked to it in the browser. Next, go further on each phase: shape the agent, then deploy it across more channels:

Configure

Create an agent with a single REST call: its prompt, voice, and tools. Configure once, reuse everywhere.

Deploy

Connect by agent_id over the API, from a browser, or to a phone number with Twilio.
Or jump straight to a topic: