Connect to Twilio
Connect Twilio Programmable Voice to the Voice Agent API so callers can have real-time conversations with your agent over the phone. Twilio handles the phone network, your server bridges audio between Twilio Media Streams and the Voice Agent API, and the agent handles speech-to-speech.
Because Twilio’s native G.711 μ-law format is byte-compatible with the Voice Agent API’s audio/pcmu encoding, the server forwards audio as-is with zero transcoding.
Before you begin
To complete this guide, you need:
- An AssemblyAI API key with Voice Agent access.
- A Twilio account with a phone number. You can buy one if you don’t have one.
- Node.js 20+.
- ngrok for exposing your local server to the internet.
Quickstart
Clone the example repo and get a working Twilio voice agent in minutes.
Start an ngrok tunnel
Twilio needs a public URL to reach your local server. In a separate terminal, start ngrok:
Copy the https://...ngrok.app URL from the output.
Point Twilio at your server
In the Twilio Console, open your phone number’s Voice configuration and set:
- A call comes in → Webhook →
POST→https://<your-ngrok-domain>/twiml - Call status changes (optional) → Webhook →
POST→https://<your-ngrok-domain>/call-status
How it works
When a call comes in, the following sequence happens:
- A caller dials your Twilio number.
- Twilio sends a webhook to
POST /twimlon your server. The server returns TwiML containing a<Stream>element pointed at your WebSocket endpoint. - Twilio opens a Media Streams WebSocket and starts sending the caller’s audio (G.711 μ-law, 8 kHz).
- Your server opens a parallel WebSocket to the Voice Agent API and sends a
session.updatewith the system prompt, voice, greeting, tools, and audio format set toaudio/pcmu. - Once
session.readyfires, the server forwards audio in both directions:- Caller → Agent: Each Twilio
mediaevent becomes aninput.audioevent. - Agent → Caller: Each
reply.audioevent becomes a Twiliomediaaction.
- Caller → Agent: Each Twilio
- When the caller barges in (
input.speech.started), the server sends a Twilioclearaction so the agent stops talking immediately.
Return TwiML with a stream
When Twilio receives a call, it hits your /twiml endpoint. The server responds with TwiML that opens a Media Streams WebSocket:
Connect to the Voice Agent API
When Twilio opens the Media Streams WebSocket, the server creates a parallel connection to the Voice Agent API and sends the session configuration:
Both input and output use audio/pcmu (G.711 μ-law at 8 kHz) to match Twilio’s native codec. This means no transcoding or resampling is needed.
Bridge audio between Twilio and the Voice Agent API
Once session.ready fires, forward audio payloads in both directions:
Handle barge-in
When the caller starts speaking while the agent is talking, clear the Twilio audio buffer so the agent stops immediately:
Make outbound calls
The example repo also supports outbound calling. Set the Twilio credentials in .env:
With the server still running, open a new terminal and run:
This places a call from your Twilio number to the target. Twilio fetches /outbound-twiml, which connects the call to /outbound-stream. The agent speaks first using the configured greeting.
Add custom tools
The example includes one tool, generate_random_number. To add your own tools:
- Define the tool in the
TOOLSarray insrc/bot.ts:
- Add the handler in
runTool:
- When the agent calls a tool, the Voice Agent API sends a
tool.callevent. The server runs the tool and sends back atool.resultevent with the samecall_id. The agent then continues the conversation naturally.
For more on tool calling, see Add tools to your agent.
Troubleshooting
- Call connects but no audio — Check that
HOSTNAMEmatches your ngrok domain and that your server is reachable. Watch ngrok’s request log for the incoming Media Streams WebSocket. session.errorwithinvalid_valueon thevoicefield — Voice names are case-sensitive. Use lowercase (ivy,claire,dawn, etc.). See Choose a voice for available voices.- Greeting plays but later replies don’t — Make sure your tool handler always sends a
tool.resultback. The agent waits for it before continuing. - Audio is choppy or echoey — Twilio handles echo cancellation on the carrier side. If you hear echo during testing, it’s likely your speakerphone — use a headset.
Next steps
- Configure your agent — Customize the system prompt, greeting, and turn detection.
- Choose a voice — Pick a voice for your agent.
- Add tools to your agent — Give your agent the ability to call functions.
- Audio format — Learn more about supported encodings.
- Twilio Media Streams — Twilio’s documentation on Media Streams.