In this tutorial, you'll build a server application in Go that transcribes an incoming Twilio phone call into text in real-time.
You'll use a Twilio MediaStream to stream the voice data to your local server application. You'll pass the voice data to AssemblyAI to transcribe the call into text, and then print it in your terminal in real-time.
By the end of this tutorial, you'll be able to:
- Record a phone call in real-time using a Twilio MediaStream connection.
- Transcribe audio in real-time using the AssemblyAI Go SDK.
Before you get started
To complete this tutorial, you'll need:
- An upgraded AssemblyAI account
- Go installed
Step 1: Set up ngrok
Twilio will need to access your server through a publicly available URL. In this tutorial, you'll use ngrok to create a publicly available URL for an application running on your local computer.
-
Sign up for an ngrok account.
-
Install ngrok for your platform.
-
Authenticate your ngrok agent using Your Authtoken.
ngrok config add-authtoken <YOUR_TOKEN>
-
Open an ngrok tunnel for port 8080. ngrok will only tunnel connections while the following command is running.
ngrok http 8080
You'll see something similar to the output below, where the URL next to Forwarding
is the publicly available URL that forwards to your local 8080 port (https://84c5df474.ngrok-free.dev
in the example output).
Copy the Forwarding URL in your terminal output and save it for the next step.
Step 2: Set up Twilio
You'll need to register a phone number with Twilio and configure it to call your server application whenever someone calls that number. You can also use the Twilio console to update the voice URL for your phone number.
-
Sign up for a Twilio account.
-
Download Twilio CLI.
-
In a new terminal, log in using Twilio CLI. You'll be asked to enter an identifier for your new profile, for example
dev
.twilio login
-
Select the profile you created.
twilio profiles:use <YOUR_PROFILE_ID>
-
Update the voice URL for your phone number.
twilio phone-number:update <YOUR_TWILIO_NUMBER> --voice-url <YOUR_NGROK_URL>
Now, when someone calls your phone number, they'll be forwarded to port 8080 on your local computer. Not having to deploy every change to a cloud instance is going to speed up the development process.
Next up, you'll build the server application to handle the phone call.
Step 3: Create a WebSocket server for Twilio media streams
In this step, you'll set up the Go server application to accept the Twilio MediaStream.
-
Create and navigate into a new project directory.
mkdir realtime-transcription-go cd realtime-transcription-go
-
Initialize your Go module.
go mod init realtime-transcription-go
-
Install the AssemblyAI Go SDK.
go get github.com/AssemblyAI/assemblyai-go-sdk
-
Install the WebSocket module by Quinn Rivenwell. You'll need this to handle the incoming Twilio MediaStream connection.
go get nhooyr.io/websocket
-
Create a new file called
main.go
with the following content:package main import ( "log" "net/http" "os" ) var apiKey = os.Getenv("ASSEMBLYAI_API_KEY") func main() { http.HandleFunc("/", twilio) http.HandleFunc("/media", media) log.Println("Server is running on port 8080") if err := http.ListenAndServe(":8080", nil); err != nil { log.Fatal(err) } } func twilio(w http.ResponseWriter, r *http.Request) { // Respond to Twilio request and initiate a Twilio MediaStream. } func media(w http.ResponseWriter, r *http.Request) { // Serve the incoming WebSocket connection from Twilio. }
-
Set the
ASSEMBLYAI_API_KEY
environment variable. Replace<YOUR_API_KEY>
with your AssemblyAI API key.export ASSEMBLYAI_API_KEY=<YOUR_API_KEY>
-
To start the server, enter the following in your terminal:
go run main.go
You now have a running server. Next, you'll implement the twilio
and media
HTTP handler functions to accept the incoming request from Twilio.
Step 4: Initiate Twilio media stream
When someone calls the phone number, Twilio makes an HTTP request to an endpoint in which you define how you want to respond.
In this step, you'll use TwiML to define the instructions that tell Twilio what to do when you receive an incoming call:
- Instruct the caller on how to use the transcriber.
- Ask Twilio to stream the audio to the Go server using a WebSocket.
To issue your TwiML instructions, add the following code to your twilio
function:
func twilio(w http.ResponseWriter, r *http.Request) {
if r.Method != "POST" {
w.WriteHeader(http.StatusMethodNotAllowed)
return
}
twiML := `<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Say>
Speak to see your audio transcribed in the console.
</Say>
<Connect>
<Stream url='%s' />
</Connect>
</Response>`
w.Header().Add("Content-Type", "application/xml")
fmt.Fprintln(w, fmt.Sprintf(twiML, "wss://"+r.Host+"/media"))
}
When Twilio receives your instructions, they'll attempt to establish a WebSocket connection to the /media
path of your server application.
Step 5: Handle incoming Twilio media stream
In this step, you'll define how to accept the incoming WebSocket connection from Twilio and process incoming messages. Twilio sends four types of messages:
connected
tells you that the media stream is connected.start
tells you that Twilio started the media stream. After this you'll receive severalmedia
messages.media
contains audio samples for the incoming phone call.stop
tells you that Twilio stopped the media stream.
To accept the WebSocket connection, and handle the messages:
-
Define a struct to represent each incoming WebSocket message. Messages are JSON encoded, so you'll use
wsjson
from thewebsocket
module to decode them into structs.type TwilioMessage struct { Event string `json:"event"` Media struct { // Contains audio samples. Payload []byte `json:"payload"` } `json:"media"` }
-
Import the WebSocket modules:
import ( // ... "nhooyr.io/websocket" "nhooyr.io/websocket/wsjson" )
-
Add the following code to the
media
HTTP handler function to print each message:func media(w http.ResponseWriter, r *http.Request) { // Upgrade HTTP request to WebSocket. c, err := websocket.Accept(w, r, nil) if err != nil { log.Println("unable to upgrade connection to websocket:", err) w.WriteHeader(http.StatusInternalServerError) return } defer c.CloseNow() ctx, cancel := context.WithCancel(r.Context()) defer cancel() for { var message TwilioMessage err = wsjson.Read(ctx, c, &message) if err != nil { log.Println("unable to read twilio message:", err) c.Close(websocket.StatusInternalError, err.Error()) return } switch message.Event { case "connected": log.Println("twilio mediastream connected") case "start": log.Println("twilio mediastream started") case "media": // TODO: Send audio to AssemblyAI for transcription. case "stop": log.Println("twilio mediastream stopped") c.Close(websocket.StatusNormalClosure, "") return } } }
Next, you'll transcribe the incoming audio samples using AssemblyAI's Real-Time Transcription.
Step 6: Transcribe media stream using AssemblyAI real-time transcription
AssemblyAI Real-Time Transcription lets you transcribe voice data while it's being spoken.
-
Create a transcriber that implements RealTimeHandler to handle real-time messages from AssemblyAI.
type realtimeTranscriber struct { } func (t *realtimeTranscriber) SessionBegins(ev aai.SessionBegins) { log.Println("session begins") } func (t *realtimeTranscriber) SessionTerminated(ev aai.SessionTerminated) { log.Println("session terminated") } func (t *realtimeTranscriber) FinalTranscript(transcript aai.FinalTranscript) { fmt.Printf("%s\r\n", transcript.Text) } func (t *realtimeTranscriber) PartialTranscript(transcript aai.PartialTranscript) { // Ignore silence. if transcript.Text == "" { return } fmt.Printf("%s\r", transcript.Text) } func (t *realtimeTranscriber) Error(err error) { log.Println("something bad happened:", err) }
-
Import the AssemblyAI SDK.
import ( // ... aai "github.com/AssemblyAI/assemblyai-go-sdk" )
-
In the
media
HTTP handler function, create and connect the real-time client just before thefor
loop.handler := new(realtimeTranscriber) client := aai.NewRealTimeClientWithOptions( aai.WithRealTimeAPIKey(apiKey), // Twilio MediaStream sends audio in mu-law format. aai.WithRealTimeEncoding(aai.RealTimeEncodingPCMMulaw), // Twilio MediaStream sends audio at 8000 samples per second. aai.WithRealTimeSampleRate(8000), aai.WithHandler(handler), ) if err := client.Connect(ctx); err != nil { log.Println("unable to connect to real-time transcription:", err) c.Close(websocket.StatusInternalError, err.Error()) return } log.Println("connected to real-time transcription")
-
In the
switch
statement, forward the incoming audio samples to AssemblyAI.case "media": if err := client.Send(ctx, message.Media.Payload); err != nil { log.Println("unable to send audio for real-time transcription:", err) c.Close(websocket.StatusInternalError, err.Error()) return }
-
Disconnect the transcriber once the media stream has stopped.
case "stop": log.Println("twilio mediastream stopped") if err := client.Disconnect(ctx, true); err != nil { log.Println("unable to disconnect from real-time transcription:", err) c.Close(websocket.StatusInternalError, err.Error()) return } log.Println("disconnected from real-time transcription") c.Close(websocket.StatusNormalClosure, "") return
Step 7: Test your application
Start the server with go run main.go
and call your phone number. If prompted by your operating system, allow the application to access the network.
Once instructed by the voice, start speaking to see your call transcribed in your server log.
session begins
connected to real-time transcription
twilio mediastream connected
twilio mediastream started
Hi. I've arrived at the gate with your food delivery.
twilio mediastream stopped
session terminated
disconnected from real-time transcription
Learn more
In this tutorial, you built a Go application that transcribes incoming phone calls in real-time, using Twilio and AssemblyAI.
To learn more about Real-Time Transcription, check out the following resources:
To keep up with more content like this, make sure you subscribe to our newsletter and join our Discord server.