audio/pcmu encoding, the server forwards audio as-is with zero transcoding.
Before you begin
To complete this guide, you need:- An AssemblyAI API key with Voice Agent access.
- A Twilio account with a phone number. You can buy one if you don’t have one.
- Node.js 20+.
- ngrok for exposing your local server to the internet.
Quickstart
Clone the example repo and get a working Twilio voice agent in minutes.Start an ngrok tunnel
Twilio needs a public URL to reach your local server. In a separate terminal, start ngrok:Copy the
https://...ngrok.app URL from the output.Point Twilio at your server
In the Twilio Console, open your phone number’s Voice configuration and set:
- A call comes in → Webhook →
POST→https://<your-ngrok-domain>/twiml - Call status changes (optional) → Webhook →
POST→https://<your-ngrok-domain>/call-status
How it works
When a call comes in, the following sequence happens:- A caller dials your Twilio number.
- Twilio sends a webhook to
POST /twimlon your server. The server returns TwiML containing a<Stream>element pointed at your WebSocket endpoint. - Twilio opens a Media Streams WebSocket and starts sending the caller’s audio (G.711 μ-law, 8 kHz).
- Your server opens a parallel WebSocket to the Voice Agent API and sends a
session.updatewith the system prompt, voice, greeting, tools, and audio format set toaudio/pcmu. - Once
session.readyfires, the server forwards audio in both directions:- Caller → Agent: Each Twilio
mediaevent becomes aninput.audioevent. - Agent → Caller: Each
reply.audioevent becomes a Twiliomediaaction.
- Caller → Agent: Each Twilio
- When the caller barges in (
input.speech.started), the server sends a Twilioclearaction so the agent stops talking immediately.
Return TwiML with a stream
When Twilio receives a call, it hits your/twiml endpoint. The server responds with TwiML that opens a Media Streams WebSocket:
Connect to the Voice Agent API
When Twilio opens the Media Streams WebSocket, the server creates a parallel connection to the Voice Agent API and sends the session configuration:Both
input and output use audio/pcmu (G.711 μ-law at 8 kHz) to match Twilio’s native codec. This means no transcoding or resampling is needed.Bridge audio between Twilio and the Voice Agent API
Oncesession.ready fires, forward audio payloads in both directions:
Handle barge-in
When the caller starts speaking while the agent is talking, clear the Twilio audio buffer so the agent stops immediately:Make outbound calls
The example repo also supports outbound calling. Set the Twilio credentials in.env:
/outbound-twiml, which connects the call to /outbound-stream. The agent speaks first using the configured greeting.
Add custom tools
The example includes one tool,generate_random_number. To add your own tools:
- Define the tool in the
TOOLSarray insrc/bot.ts:
- Add the handler in
runTool:
- When the agent calls a tool, the Voice Agent API sends a
tool.callevent. The server runs the tool and sends back atool.resultevent with the samecall_id. The agent then continues the conversation naturally.
Troubleshooting
- Call connects but no audio: Check that
HOSTNAMEmatches your ngrok domain and that your server is reachable. Watch ngrok’s request log for the incoming Media Streams WebSocket. session.errorwithinvalid_valueon thevoicefield: Voice names are case-sensitive. Use lowercase (ivy,claire,dawn, etc.). See Choose a voice for available voices.- Greeting plays but later replies don’t: Make sure your tool handler always sends a
tool.resultback. The agent waits for it before continuing. - Audio is choppy or echoey: Twilio handles echo cancellation on the carrier side. If you hear echo during testing, it’s likely your speakerphone. Use a headset.
Next steps
- Configure your agent: Customize the system prompt, greeting, and turn detection.
- Choose a voice: Pick a voice for your agent.
- Add tools to your agent: Give your agent the ability to call functions.
- Audio format: Learn more about supported encodings.
- Twilio Media Streams: Twilio’s documentation on Media Streams.