Browser integration - AssemblyAI

Connect a browser to the Voice Agent API in two steps:

Your server calls GET /v1/token with your API key to mint a short-lived temporary token.
Your browser opens the WebSocket with ?token=<token>, no API key exposed.

Your API key never leaves your server. Each token is single-use, it starts exactly one session, and all usage is attributed to the key that generated it.

Browsers provide built-in acoustic echo cancellation through getUserMedia, so browser-based clients work hands-free without headphones. If you’re developing on a laptop, the browser integration is the recommended starting point.

1. Generate a token on your server

Call GET /v1/token with your API key in the Authorization header. Pick an expires_in_seconds short enough to limit replay risk (60–300s is a good default) and an optional max_session_duration_seconds to cap the session length.

These two parameters control different things and are easy to confuse:

expires_in_seconds is the token redemption window — how long the client has to use this token to open a WebSocket. If the window elapses before the WebSocket is opened, the server returns a session.error with code unauthorized on the first frame instead of session.ready. Once a session.ready has been received, this value no longer applies.
max_session_duration_seconds is the session duration cap — how long the resulting voice agent session is allowed to run after the WebSocket is open.

API reference

View the endpoint reference.

// server/routes/voice-token.js
import express from "express";

const router = express.Router();

router.get("/voice-token", async (_req, res) => {
  const url = new URL("https://agents.assemblyai.com/v1/token");
  url.searchParams.set("expires_in_seconds", "300");
  url.searchParams.set("max_session_duration_seconds", "8640");

  const response = await fetch(url, {
    headers: { Authorization: `Bearer ${process.env.ASSEMBLYAI_API_KEY}` },
  });

  if (!response.ok) {
    return res.status(response.status).send(await response.text());
  }

  const { token } = await response.json();
  res.json({ token });
});

export default router;

expires_in_seconds must be between 1 and 600. max_session_duration_seconds must be between 60 and 10800 (defaults to 10800, the 3-hour maximum session duration).

Session end at `max_session_duration_seconds`

When the session reaches its server-side duration limit, the WebSocket closes. There is no separate “closing soon” warning event before this — if you need to finalize gracefully (e.g. play a wrap-up message, save state), run a client-side timer using the value you passed for max_session_duration_seconds and start your wrap-up a few seconds before it elapses.

Token expiry and failure modes

If a token is missing, expired, or invalid, the server rejects the handshake with an UNAUTHORIZED error (close code 1008). In browsers, this may surface as a close event with code 1006 and no body, you won’t receive a session.error event. Always fetch a fresh token immediately before each connection attempt. If the WebSocket drops mid-session and you need to reconnect with session.resume, you’ll need a new token for the new WebSocket, the original token can’t be reused.

2. Connect from the browser with the token

Fetch the token from your server, then open the WebSocket with ?token=<token>. No Authorization header is needed.

// browser/voice-agent.js
const { token } = await fetch("/api/voice-token").then((r) => r.json());

const wsUrl = new URL("wss://agents.assemblyai.com/v1/ws");
wsUrl.searchParams.set("token", token);
const ws = new WebSocket(wsUrl);

ws.addEventListener("open", () => {
  ws.send(
    JSON.stringify({
      type: "session.update",
      session: {
        system_prompt: "You are a helpful voice assistant.",
        greeting: "Hi there! How can I help you today?",
        output: { voice: "ivy" },
      },
    }),
  );
});

ws.addEventListener("message", (event) => {
  const message = JSON.parse(event.data);
  // Handle session.ready, reply.audio, transcript.*, tool.call, etc.
  console.log(message);
});

Fetch a fresh token for every new WebSocket connection. Tokens are single-use, a dropped connection needs a new token to reconnect (including when using session.resume).

3. Browser quickstart

A complete working example that captures microphone audio, streams it to the Voice Agent API, and plays back the agent’s response. This requires two files, an HTML page and an AudioWorklet processor.

AudioWorklet processors must be loaded from a URL (audioContext.audioWorklet.addModule(url)), so you need at least two files. This example won’t work in a single-file environment like CodePen or JSFiddle without modifications. Use a local server (npx serve .) or a framework with static file support.

Create pcm-processor.js in the same directory as your HTML file:

// pcm-processor.js - AudioWorklet that captures PCM16 from the mic
class PCMProcessor extends AudioWorkletProcessor {
  process(inputs) {
    const input = inputs[0]?.[0];
    if (input) {
      // Convert Float32 [-1, 1] to Int16
      const pcm16 = new Int16Array(input.length);
      for (let i = 0; i < input.length; i++) {
        pcm16[i] = Math.max(-32768, Math.min(32767, Math.round(input[i] * 32767)));
      }
      this.port.postMessage(pcm16.buffer, [pcm16.buffer]);
    }
    return true;
  }
}

registerProcessor("pcm-processor", PCMProcessor);

Then create your HTML file:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>Voice Agent</title>
</head>
<body>
  <button id="start">Start conversation</button>
  <pre id="log"></pre>
  <script>
    const log = (msg) => { document.getElementById("log").textContent += msg + "\n"; };

    document.getElementById("start").addEventListener("click", async () => {
      // 1. Get token from your server (see step 1 above)
      const { token } = await fetch("/api/voice-token").then((r) => r.json());

      // 2. Force AudioContext to 24 kHz - avoids manual resampling on both
      //    capture and playback in Chromium and Firefox. Safari ignores this
      //    option (see Browser compatibility below) and runs at the hardware
      //    rate, so production code should resample inside the worklet.
      const audioCtx = new AudioContext({ sampleRate: 24000 });
      await audioCtx.audioWorklet.addModule("pcm-processor.js");

      // 3. Capture mic audio with echo cancellation enabled
      const stream = await navigator.mediaDevices.getUserMedia({
        audio: { echoCancellation: true, sampleRate: 24000 },
      });
      const source = audioCtx.createMediaStreamSource(stream);
      const worklet = new AudioWorkletNode(audioCtx, "pcm-processor");

      // 4. Connect WebSocket
      const wsUrl = new URL("wss://agents.assemblyai.com/v1/ws");
      wsUrl.searchParams.set("token", token);
      const ws = new WebSocket(wsUrl);

      let ready = false;
      let playbackTime = audioCtx.currentTime;

      // Send mic audio to the server once the session is ready
      worklet.port.onmessage = (e) => {
        if (ready && ws.readyState === WebSocket.OPEN) {
          const b64 = btoa(String.fromCharCode(...new Uint8Array(e.data)));
          ws.send(JSON.stringify({ type: "input.audio", audio: b64 }));
        }
      };
      source.connect(worklet).connect(audioCtx.destination);

      ws.addEventListener("open", () => {
        ws.send(JSON.stringify({
          type: "session.update",
          session: {
            system_prompt: "You are a helpful voice assistant. Keep responses concise.",
            greeting: "Hi! How can I help you?",
            output: { voice: "ivy" },
          },
        }));
      });

      ws.addEventListener("message", (event) => {
        const msg = JSON.parse(event.data);

        if (msg.type === "session.ready") {
          ready = true;
          log("Session ready, start speaking");
        } else if (msg.type === "reply.audio") {
          // Decode base64 PCM16 and schedule playback
          const raw = atob(msg.data);
          const pcm16 = new Int16Array(raw.length / 2);
          for (let i = 0; i < pcm16.length; i++) {
            pcm16[i] = raw.charCodeAt(i * 2) | (raw.charCodeAt(i * 2 + 1) << 8);
          }
          const float32 = new Float32Array(pcm16.length);
          for (let i = 0; i < pcm16.length; i++) {
            float32[i] = pcm16[i] / 32768;
          }
          const buffer = audioCtx.createBuffer(1, float32.length, 24000);
          buffer.getChannelData(0).set(float32);
          const src = audioCtx.createBufferSource();
          src.buffer = buffer;
          src.connect(audioCtx.destination);
          const now = audioCtx.currentTime;
          playbackTime = Math.max(playbackTime, now);
          src.start(playbackTime);
          playbackTime += buffer.duration;
        } else if (msg.type === "reply.done" && msg.status === "interrupted") {
          // Reset playback schedule to avoid stale audio
          playbackTime = audioCtx.currentTime;
        } else if (msg.type === "transcript.user") {
          log("You: " + msg.text);
        } else if (msg.type === "transcript.agent") {
          log("Agent: " + msg.text);
        } else if (msg.type === "session.error" || msg.type === "error") {
          log("Error: " + msg.message);
        }
      });

      ws.addEventListener("close", () => log("Connection closed"));
    });
  </script>
</body>
</html>

The key line is new AudioContext({ sampleRate: 24000 }). Most browsers default to the device sample rate (usually 48 kHz), so without this you’d need to manually resample both mic input and playback output. Forcing 24 kHz on the context avoids this entirely. Safari ignores this option and runs at the hardware rate — see Browser compatibility for a Safari-safe pipeline.

4. Browser compatibility

The quickstart above works as-is on Chromium-based browsers (Chrome, Edge, Brave, Arc) and Firefox. Safari has a known quirk that produces silently garbled audio if you don’t account for it.

Browser	`AudioContext({ sampleRate })` honored	Recommended pipeline
Chrome / Edge	Yes	Use the quickstart as-is.
Firefox	Yes	Use the quickstart as-is.
Safari (desktop, iOS)	No — runs at hardware rate (typically 48 kHz)	Let `AudioContext` use its default rate and resample to/from 24 kHz inside the worklet (capture) and before playback.

Safari: resample inside the worklet

Safari ignores the sampleRate constructor option, so an AudioContext({ sampleRate: 24000 }) will silently run at 48 kHz on most Macs. Sending those samples to the Voice Agent API as if they were 24 kHz produces audio that sounds chipmunked or garbled. Detect the actual context rate at runtime, send it into the worklet, and resample there:

// browser/voice-agent.js — Safari-safe context
const audioCtx = new AudioContext(); // let Safari pick its hardware rate
await audioCtx.audioWorklet.addModule("pcm-processor.js");
const worklet = new AudioWorkletNode(audioCtx, "pcm-processor", {
  processorOptions: { inputSampleRate: audioCtx.sampleRate, targetSampleRate: 24000 },
});

// pcm-processor.js — linear resample to 24 kHz before posting PCM16
class PCMProcessor extends AudioWorkletProcessor {
  constructor(options) {
    super();
    const { inputSampleRate, targetSampleRate } = options.processorOptions;
    this.ratio = inputSampleRate / targetSampleRate;
  }
  process(inputs) {
    const input = inputs[0]?.[0];
    if (!input) return true;
    const outLength = Math.floor(input.length / this.ratio);
    const pcm16 = new Int16Array(outLength);
    for (let i = 0; i < outLength; i++) {
      const sample = input[Math.floor(i * this.ratio)] ?? 0;
      pcm16[i] = Math.max(-32768, Math.min(32767, Math.round(sample * 32767)));
    }
    this.port.postMessage(pcm16.buffer, [pcm16.buffer]);
    return true;
  }
}
registerProcessor("pcm-processor", PCMProcessor);

For playback, build the AudioBuffer at 24 kHz and let the context resample on output, or resample the decoded PCM16 to audioCtx.sampleRate before scheduling — the simplest version (createBuffer(1, length, 24000)) works on all current browsers.

Linear interpolation is good enough for speech at 24 kHz. If you want higher fidelity, use a windowed-sinc resampler such as libsamplerate compiled to WASM, or push the PCM16 through an OfflineAudioContext at the target rate.

Cross-browser checklist

User gesture required. All major browsers gate getUserMedia and AudioContext startup behind a user gesture (Safari is strictest). Start audio inside a click or touchstart handler and call await audioCtx.resume() before connecting nodes.
HTTPS or localhost. getUserMedia only works on secure origins.
Echo cancellation. Pass echoCancellation: true to getUserMedia so the agent’s TTS playing through the speakers doesn’t get re-captured by the mic.
Audio output sink. On iOS Safari, set the <audio playsinline> attribute or route through an AudioContext destination — autoplay and full-screen behavior differ from desktop.

Documentation Index

​1. Generate a token on your server

API reference

​Session end at max_session_duration_seconds

​Token expiry and failure modes

​2. Connect from the browser with the token

​3. Browser quickstart

​4. Browser compatibility

​Safari: resample inside the worklet

​Cross-browser checklist

1. Generate a token on your server

Session end at `max_session_duration_seconds`

Token expiry and failure modes

2. Connect from the browser with the token

3. Browser quickstart

4. Browser compatibility

Safari: resample inside the worklet

Cross-browser checklist