Browser integration
Connect browser-based apps to the Voice Agent API using a temporary token.
Connect a browser to the Voice Agent API in two steps:
- Your server calls
GET /v1/tokenwith your API key to mint a short-lived temporary token. - Your browser opens the WebSocket with
?token=<token>, no API key exposed.
Your API key never leaves your server. Each token is single-use, it starts exactly one session, and all usage is attributed to the key that generated it.
Browsers provide built-in acoustic echo cancellation through getUserMedia, so browser-based clients work hands-free without headphones. If you’re developing on a laptop, the browser integration is the recommended starting point.
1. Generate a token on your server
Call GET /v1/token with your API key in the Authorization header. Pick an expires_in_seconds short enough to limit replay risk (60–300s is a good default) and an optional max_session_duration_seconds to cap the session length.
These two parameters control different things and are easy to confuse:
expires_in_secondsis the token redemption window — how long the client has to use this token to open a WebSocket. If the window elapses before the WebSocket is opened, the server returns asession.errorwith codeunauthorizedon the first frame instead ofsession.ready. Once asession.readyhas been received, this value no longer applies.max_session_duration_secondsis the session duration cap — how long the resulting voice agent session is allowed to run after the WebSocket is open.
expires_in_seconds must be between 1 and 600. max_session_duration_seconds must be between 60 and 10800 (defaults to 10800, the 3-hour maximum session duration).
Session end at max_session_duration_seconds
When the session reaches its server-side duration limit, the WebSocket closes. There is no separate “closing soon” warning event before this — if you need to finalize gracefully (e.g. play a wrap-up message, save state), run a client-side timer using the value you passed for max_session_duration_seconds and start your wrap-up a few seconds before it elapses.
Token expiry and failure modes
If a token is missing, expired, or invalid, the server rejects the handshake with an UNAUTHORIZED error (close code 1008). In browsers, this may surface as a close event with code 1006 and no body, you won’t receive a session.error event. Always fetch a fresh token immediately before each connection attempt.
If the WebSocket drops mid-session and you need to reconnect with session.resume, you’ll need a new token for the new WebSocket, the original token can’t be reused.
2. Connect from the browser with the token
Fetch the token from your server, then open the WebSocket with ?token=<token>. No Authorization header is needed.
Fetch a fresh token for every new WebSocket connection. Tokens are single-use, a dropped connection needs a new token to reconnect (including when using session.resume).
3. Browser quickstart
A complete working example that captures microphone audio, streams it to the Voice Agent API, and plays back the agent’s response. This requires two files, an HTML page and an AudioWorklet processor.
AudioWorklet processors must be loaded from a URL (audioContext.audioWorklet.addModule(url)), so you need at least two files. This example won’t work in a single-file environment like CodePen or JSFiddle without modifications. Use a local server (npx serve .) or a framework with static file support.
Create pcm-processor.js in the same directory as your HTML file:
Then create your HTML file:
The key line is new AudioContext({ sampleRate: 24000 }). Most browsers default to the device sample rate (usually 48 kHz), so without this you’d need to manually resample both mic input and playback output. Forcing 24 kHz on the context avoids this entirely. Safari ignores this option and runs at the hardware rate — see Browser compatibility for a Safari-safe pipeline.
4. Browser compatibility
The quickstart above works as-is on Chromium-based browsers (Chrome, Edge, Brave, Arc) and Firefox. Safari has a known quirk that produces silently garbled audio if you don’t account for it.
Safari: resample inside the worklet
Safari ignores the sampleRate constructor option, so an AudioContext({ sampleRate: 24000 }) will silently run at 48 kHz on most Macs. Sending those samples to the Voice Agent API as if they were 24 kHz produces audio that sounds chipmunked or garbled.
Detect the actual context rate at runtime, send it into the worklet, and resample there:
For playback, build the AudioBuffer at 24 kHz and let the context resample on output, or resample the decoded PCM16 to audioCtx.sampleRate before scheduling — the simplest version (createBuffer(1, length, 24000)) works on all current browsers.
Linear interpolation is good enough for speech at 24 kHz. If you want higher fidelity, use a windowed-sinc resampler such as libsamplerate compiled to WASM, or push the PCM16 through an OfflineAudioContext at the target rate.
Cross-browser checklist
- User gesture required. All major browsers gate
getUserMediaandAudioContextstartup behind a user gesture (Safari is strictest). Start audio inside aclickortouchstarthandler and callawait audioCtx.resume()before connecting nodes. - HTTPS or
localhost.getUserMediaonly works on secure origins. - Echo cancellation. Pass
echoCancellation: truetogetUserMediaso the agent’s TTS playing through the speakers doesn’t get re-captured by the mic. - Audio output sink. On iOS Safari, set the
<audio playsinline>attribute or route through anAudioContextdestination — autoplay and full-screen behavior differ from desktop.