In this tutorial, we’ll learn how to build a Flask application that transcribes phone calls in real-time. Here’s a look at the end result:
Overview
Let’s start with a high-level overview of how our service will work. First, a user will call a Twilio number which we will configure to forward the incoming data stream to AssemblyAI. Then, AssemblyAI’s real-time transcription service will transcribe this audio stream as it comes in, and send us partial transcripts in quick succession, which we’ll print to the terminal as they come in.
Now let’s take a look at how this will work more technically.
- First, a user calls the phone number that we provision with Twilio.
- Twilio then calls the specific endpoint associated with this number.
- In our case, we will configure the endpoint to be an ngrok URL, which provides a tunnel to a port on our local machine from a publicly accessible URL. Ngrok therefore allows us to expose our application to Twilio without having to provision a cloud machine or modify firewall rules.
- Through the ngrok tunnel, Twilio calls an endpoint in a Flask application, which responds with TwiML (Twilio Markup Language) that instructs Twilio on how to handle the call.
- In our case, the TwiML will tell Twilio to pass the incoming audio stream from the phone call to a WebSocket in our Flask application.
- This WebSocket will receive the audio stream and send it to AssemblyAI for transcription, printing the corresponding transcript to the terminal as it is received in real-time
You can find all of the code for this tutorial in this GitHub repository.
Getting Started
To get started, you’ll need
- An AssemblyAI account with funds added
- A Twilio account (free account should be sufficient)
- An ngrok account and ngrok installed on your system
- Python installed on your system
Now create and navigate into a project directory
mkdir realtime-phone-transcription
cd realtime-phone-transcription
Step 1: Set up credentials and environment
We’ll be using python-dotenv
to manage our credentials. Create a file called .env
in your project directory, and add the below text:
NGROK_AUTHTOKEN=replace-this
TWILIO_ACCOUNT_SID=replace-this
TWILIO_API_KEY_SID=replace-this
TWILIO_API_SECRET=replace-this
ASSEMBLYAI_API_KEY=replace-this
Make sure to replace replace-this
with your specific credential for each line. You can find your ngrok authtoken in the Getting Started > Your Authtoken
tab on your ngrok dashboard, or by checking the file returned by running ngrok config check
if you already have ngrok set up on your system. If you have not configured ngrok on your system, add your ngrok authtoken to your CLI by running the command
ngrok config add-authtoken YOUR-TOKEN-HERE
You can find your Twilio account SID (TWILIO_ACCOUNT_SID
) in your Twilio console under Account > API keys & tokens
. Here you can also create an API key for TWILIO_API_KEY_SID
and TWILIO_API_SECRET
. A Standard
key type is sufficient to follow along with this tutorial.
You can find your AssemblyAI API key on your AssemblyAI dashboard.
Finally, create a file called .gitignore
and copy the below text into it:
.env
venv
__pycache__
This will prevent you from accidentally tracking your .env
file with git and potentially uploading it to a website like GitHub. Additionally, it will prevent you from tracking/uploading your virtual environment and cache files.
Now create and activate a virtual environment for the project:
# Mac/Linux
python3 -m venv venv
. venv/bin/activate
# Windows
python -m venv venv
.\venv\Scripts\activate.bat
Next, we’ll install all of the dependencies we will need for the projects. Execute the below command:
pip install Flask flask-sock assemblyai python-dotenv ngrok twilio
Step 2: Create the Flask application
Create a file in the project directory called main.py
and add the following imports:
from flask import Flask
from flask_sock import Sock
These lines import Flask
and Sock
so that we can create a web application with WebSockets. Next, add these lines that define some settings for our application:
PORT = 5000
DEBUG = False
INCOMING_CALL_ROUTE = '/'
WEBSOCKET_ROUTE = '/realtime'
In particular, we set the port that the app should run on, set debugging to false, and then define the route for the HTTP endpoint that will be hit when our Twilio number is called, and the route for the WebSocket to which the audio data will be sent.
Now add the below lines to main.py
, which instantiate our app and define the functions for these endpoints. Additionally, we run the app on the specified port and debugging mode when python main.py
is executed.
app = Flask(__name__)
sock = Sock(app)
@app.route(INCOMING_CALL_ROUTE)
def receive_call():
pass
@sock.route(WEBSOCKET_ROUTE)
def transcription_websocket(ws):
pass
if __name__ == "__main__":
app.run(port=PORT, debug=DEBUG)
Step 3: Define the root endpoint
Now that the basic structure of our Flask app is defined, we can start to define our endpoints. We’ll start by defining the root endpoint that Twilio will hit when our Twilio phone number is called.
Modify your receive_call
function as follows:
@app.route(INCOMING_CALL_ROUTE)
def receive_call():
return "Real-time phone call transcription app"
Now run python main.py
from a terminal in the project directory and go to http://localhost:5000
in your browser. You will see the return message displayed:
By default, the only HTTP request method available for Flask routes is GET, and the endpoint will respond with the value returned by the corresponding function. In our case, Twilio will send a POST request to the endpoint that we associate with our Twilio number, so we need to modify this Python function accordingly.
Modify your imports and receive_call
function as follows:
from flask import Flask, request, Response
# ...
@app.route(INCOMING_CALL_ROUTE, methods=['GET', 'POST'])
def receive_call():
if request.method == 'POST':
xml = f"""
<Response>
<Say>
You have connected to the Flask application
</Say>
</Response>
""".strip()
return Response(xml, mimetype='text/xml')
else:
return f"Real-time phone call transcription app"
First, we update the app.route
decorator to allow both GET and POST requests. Then, inside the receive_call
function, we access the HTTP request information using request
imported from flask
to check what the request type/method is.
If it is a POST request, then we return a block of TwiML, which is Twilio’s version of XML that instructs Twilio on what to do when this endpoint is called. In our case, we use <Say>
tags that tell Twilio to speak the sentence between the tags to the caller. We then return an HTTP Response
which contains the TwiML, and we set the MIME type to XML.
Finally, if the HTTP request method is not POST, then it is a GET so we return the text we did previously in the else
block.
We now have a functional Flask application that will respond with TwiML if called by Twilio. The next step is to get a Twilio number and point it to this application.
Step 4: Get a Twilio number and open an ngrok tunnel
To get a Twilio number, go to your Twilio console and go to Phone Number > Manage > Buy a number
. There, you will see a list of numbers you can purchase for a small monthly fee - select one and click Buy
. Note that we only need Voice capabilities for this tutorial.
Next, we’ll open an ngrok tunnel on port 5000 (through which our Flask app will be served). In the terminal, execute the following command:
ngrok http http://localhost:5000/
In the terminal, you will see some information displayed about the tunnel. What we need is the public forwarding URL that ends in .ngrok-free.app
, so copy this value now.
Back in your Twilio console, go to Phone Numbers > Manage > Active numbers
and select the phone number you bought above. In the Voice Configuration
, set a Webhook
for when a call comes in, pasting the ngrok URL you just copied under URL
and setting the HTTP
method to HTTP POST
:
Then scroll down and click Save Configuration
to save this change.
You have now configured your Twilio number to send a POST request to the ngrok URL when your number is called, and opened a tunnel that forwards this ngrok URL to port 5000 on your local machine. With all of this in place, we can now test our application.
Open another terminal in your project directory and run python main.py
. You can go to http://localhost:5000
again to confirm that the application is up and running - you will see a 200
response in the terminal if you do so. Now call your Twilio phone number - you will hear a voice say You have connected to the Flask application
, and then the call will terminate.
We now have a working Flask application that can successfully receive and respond to a Twilio phone call - it’s time to add in the WebSocket that receives incoming speech.
Step 5: Set up a WebSocket to receive speech
Modify your receive_call
function with the TwiML below:
@app.route(INCOMING_CALL_ROUTE, methods=['GET', 'POST'])
def receive_call():
if request.method == 'POST':
xml = f"""
<Response>
<Say>
Speak to see your audio data printed to the console
</Say>
<Connect>
<Stream url='wss://{request.host}{WEBSOCKET_ROUTE}' />
</Connect>
</Response>
""".strip()
return Response(xml, mimetype='text/xml')
else:
return f"Real-time phone call transcription app"
We’ve added <Connect>
and <Stream>
tags that tell Twilio to forward the incoming audio data to the specified WebSocket . In our case, we point it to a WebSocket in the same Flask app that we will define next.
Our WebSocket will be defined in the transcription_websocket
function that we defined at the beginning of this tutorial. Import the json
package, and then modify the transcription_websocket
function as follows:
import json
# ...
@sock.route(WEBSOCKET_ROUTE)
def transcription_websocket(ws):
while True:
data = json.loads(ws.receive())
match data['event']:
case "connected":
print('twilio connected')
case "start":
print('twilio started')
case "media":
payload = data['media']['payload']
print(payload)
case "stop":
print('twilio stopped')
Our WebSocket will receive four possible types of messages from Twilio:
connected
when the WebSocket connection is establishedstart
when the data stream begins sending datamedia
which contain the raw audio data, andstop
when the stream is stopped or the call has ended
We receive each message with ws.receive()
, and then load it to a dictionary with json.loads
. We then handle each message according to its type stored in the event
key. For now, we print the binary data for media
messages and a simple message for each remaining case.
Start your Flask application by running python main.py
from the terminal in a project directory, and then call your Twilio number. You will start seeing a stream of binary, base-64-encoded data printed in your console:
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////w==
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////w==
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////w==
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////fn38fXx9e/19//79/n39ff58/nz+ff59e/5+ff1+e/19/n19fn79fA==
fPt8/P99fH7+fnx9/P3+/f39/Pz//f58/nx9//5+fv59/f5+/n1+fv99ev38ff3+/X18eX57fH5+e35+fn59/Xz9+n1+/n78/3x8+v59ffv8/X7+fn19/v55fv9+/f78fX36fH37/n79ff/+fv79/X16fvr9/v79fX17fXz+fnt+/n57fn59fXx+fXz8fv/+fv58ff5+ev1+fn39/H59fg==
/Xx9/P59/f59fv97fX1+ff5+fP57fX5+fn58ff99ff17ff7+/n7+fnv8/3x8fv/8//7+//5+/f3//H36ffp4fP58/X1++/7//P16/X38fn59/X59+3x+/Xr9+339fn79fH3+/n19fv58fX1+/P5+fnv8e//9//37ff7+/35+fvt9ff3+/f56/v37/X57ef9+fv/8en3//X78fv39/vx9fQ==
Step 6: Define a real-time transcriber
We now have our Flask application running, receiving calls to a Twilio number via an ngrok tunnel, and printing the speech data to the console. It’s time to add real-time transcription.
Create a new file in your project directory called twilio_transcriber.py
. We will define the object that we use to perform the real-time transcription in this module. At the top of the file, add the following code for imports and to set the AssemblyAI API key and define the Twilio audio sample rate:
import os
import assemblyai as aai
from dotenv import load_dotenv
load_dotenv()
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
TWILIO_SAMPLE_RATE = 8000 # Hz
Now we will add handlers for the four types of messages we will receive from AssemblyAI. Add the below functions to twilio_transcriber.py
:
def on_open(session_opened: aai.RealtimeSessionOpened):
"Called when the connection has been established."
print("Session ID:", session_opened.session_id)
def on_data(transcript: aai.RealtimeTranscript):
"Called when a new transcript has been received."
if not transcript.text:
return
if isinstance(transcript, aai.RealtimeFinalTranscript):
print(transcript.text, end="\r\n")
else:
print(transcript.text, end="\r")
def on_error(error: aai.RealtimeError):
"Called when the connection has been closed."
print("An error occured:", error)
def on_close():
"Called when the connection has been closed."
print("Closing Session")
on_open
(on_close
) is called when a connection has been established (terminated). on_error
is called when there is an error. For each of these message types, we print a single line with related information. The on_data
function is called when our Flask application receives data from AssemblyAI’s real-time transcription service. In this case, we do one of three things.
If there is no transcript (i.e. there was no speech in the audio data sent to AssemblyAI), then we simply return the default value None
. If we receive a transcript, then we do one of two things based on the transcript type.
Every message that AssemblyAI’s server sends is one of two types - either a partial transcript or a final transcript. Partial transcripts are sent in real-time when someone is speaking, gradually building up the transcript of what is being uttered. Each time a partial transcript is sent, the entire partial transcript for that utterance is sent, and not just the new words that have been spoken since the last partial transcript was sent.
When the real-time model detects that an utterance is complete, the entire utterance is sent one final time (formatted and punctuated by default) as a final transcript rather than partial. Once this final transcript is sent, we start this process over with a blank slate for the next utterance.
We can see how this process works in the below diagram:
So to handle these incoming messages, we print the transcript each time with a carriage return in the case that the transcript type is partial
. Adding the carriage return brings the cursor back to the beginning of the line so that each transcript is printed over the previous one, giving the visual effect that the newly-transcribed words are being printed over time.
Now that we have our handlers defined, we need to define the class that we will actually use to perform the transcription. Add the below class to twilio_transcriber.py
class TwilioTranscriber(aai.RealtimeTranscriber):
def __init__(self):
super().__init__(
on_data=on_data,
on_error=on_error,
on_open=on_open, # optional
on_close=on_close, # optional
sample_rate=TWILIO_SAMPLE_RATE,
encoding=aai.AudioEncoding.pcm_mulaw
)
Our TwilioTranscriber
is a subclass of the RealtimeTranscriber
class in AssemblyAI’s Python SDK. We define the initializer of TwilioTranscriber
by passing our handlers into the init
function, as well as specifying a sample rate of 8000 Hz and the encoding as PCM Mulaw, which are the settings Twilio streams use.
Step 7: Add real-time transcription to the WebSocket
Now that we have defined TwilioTranscriber
, we need to use it in our main application code. In main.py
, import base64
and TwilioTranscriber
, and then modify the transcription_websocket
to match the below code:
import base64
from twilio_transcriber import TwilioTranscriber
# ...
@sock.route(WEBSOCKET_ROUTE)
def transcription_websocket(ws):
print('called')
while True:
data = json.loads(ws.receive())
match data['event']:
case "connected":
transcriber = TwilioTranscriber()
transcriber.connect()
print('transcriber connected')
case "start":
print('twilio started')
case "media":
payload_b64 = data['media']['payload']
payload_mulaw = base64.b64decode(payload_b64)
transcriber.stream(payload_mulaw)
case "stop":
print('twilio stopped')
transcriber.close()
print('transcriber closed')
We’ve updated our connected
handler to instantiate a TwilioTranscriber
and connect to AssemblyAI’s servers, updated the media
handler to decode the binary audio data and then pass it to the transcriber
’s stream
method, and updated the stop
handler to close the transcriber
’s connection to AssemblyAI’s servers.
Finally, update the <Say>
tags in the receive_call
function to contain a fitting phrase now that our console will print the audio transcription rather than just the audio data:
<Say>
Speak to see your speech transcribed in the console
</Say>
Run python main.py
in a terminal from the project directory, and call your Twilio number. As you speak, you will see your speech transcribed in the console.
Step 8: Automatically set the Twilio webhook and ngrok tunnel
Our application is running and fully functional, but we can further improve it. Currently, every time we want to run the application, we must open an ngrok tunnel in a separate terminal and then copy the forwarding URL from this terminal into Twilio’s console a the browser.
This is fairly laborious, so it’s time to automate these steps. First, update your .env
file to include your Twilio number as a TWILIO_NUMBER
environment variable:
NGROK_AUTHTOKEN=replace-this
TWILIO_ACCOUNT_SID=replace-this
TWILIO_API_KEY_SID=replace-this
TWILIO_API_SECRET=replace-this
ASSEMBLYAI_API_KEY=replace-this
TWILIO_NUMBER=replace-this
The number should be represented as a sequence of digits including a country area code. For example, +1234567891
would be a valid number for the United States.
Now, update the top of your main.py
file as follows:
import base64
import json
import os
from flask import Flask, request, Response
from flask_sock import Sock
import ngrok
from twilio.rest import Client
from dotenv import load_dotenv
load_dotenv()
from twilio_transcriber import TwilioTranscriber
# ...
# Twilio authentication
account_sid = os.environ['TWILIO_ACCOUNT_SID']
api_key = os.environ['TWILIO_API_KEY_SID']
api_secret = os.environ['TWILIO_API_SECRET']
client = Client(api_key, api_secret, account_sid)
# Twilio phone number to call
TWILIO_NUMBER = os.environ['TWILIO_NUMBER']
# ngrok authentication
ngrok.set_auth_token(os.getenv("NGROK_AUTHTOKEN"))
We’ve added authentication variables to instantiate a Twilio Client
, and imported our Twilio phone number environment variable. Finally, we’ve set our ngrok auth token through ngrok.set_auth_token
.
Next, update the script’s main block as follows:
if __name__ == "__main__":
try:
# Open Ngrok tunnel
listener = ngrok.forward(f"http://localhost:{PORT}")
print(f"Ngrok tunnel opened at {listener.url()} for port {PORT}")
NGROK_URL = listener.url()
# Set ngrok URL to be the webhook for the appropriate Twilio number
twilio_numbers = client.incoming_phone_numbers.list()
twilio_number_sid = [num.sid for num in twilio_numbers if num.phone_number == TWILIO_NUMBER][0]
client.incoming_phone_numbers(twilio_number_sid).update(account_sid, voice_url=f"{NGROK_URL}{INCOMING_CALL_ROUTE}")
# run the app
app.run(port=PORT, debug=DEBUG)
finally:
# Always disconnect the ngrok tunnel
ngrok.disconnect()
First, we open up an ngrok tunnel with ngrok.forward
, and then use the twilio
library to programmatically set our Twilio number’s voice webhook to the URL of the tunnel. It appears that it is not possible to call the incoming_phone_numbers
method directly on our Twilio number, so we first have to isolate its SID with a list comprehension and then pass the SID into this method. Finally, we run our app as before with app.run()
. All of this code is wrapped in a try…except
block that ensures our ngrok tunnel is always terminated properly.
If you have a free ngrok account you can only have one tunnel open at a time, so close your previous tunnel if it is still open, and then run python main.py
in order to execute our program. Call your Twilio number and speak to see your speech transcriber to the console without having to manually open an ngrok tunnel and update the Twilio console:
Final words
In this tutorial, we learned how to use Twilio and AssemblyAI to transcribe phone calls in real-time. Check out our other tutorials or our documentation to learn more about how you can build AI-powered features to analyze speech.
Alternatively, check out other content on our Blog or YouTube channel to learn more about AI, or feel free to join us on Twitter or Discord to stay in the loop when we release new content.