As technology advances, scam attempts are becoming increasingly sophisticated, and harder to identify. Speech AI uses recent advances in speech recognition and generative AI to understand spoken language, and can give us an upper hand against scammers.
In this tutorial, you’ll build an application that uses Speech AI to transcribe a phone call and identify scam attempts.
By the end of this tutorial, you’ll be able to:
- Transcribe a phone call in real-time using AssemblyAI and Twilio.
- Summarize and assess a phone call transcript using LeMUR.
Before you get started
To complete this tutorial, you'll need:
- An upgraded AssemblyAI account
- Go installed
Step 1: Set up ngrok
Twilio will need to access your server through a publicly available URL. In this tutorial, you'll use ngrok to create a publicly available URL for an application running on your local computer.
-
Sign up for an ngrok account.
-
Install ngrok for your platform.
-
Authenticate your ngrok agent using Your Authtoken.
ngrok config add-authtoken <YOUR_TOKEN>
-
Open an ngrok tunnel for port 8080. ngrok will only tunnel connections while the following command is running.
ngrok http 8080
You'll see something similar to the output below, where the URL next to Forwarding
is the publicly available URL that forwards to your local 8080 port (https://84c5df474.ngrok-free.dev
in the example output).
Copy the Forwarding URL in your terminal output and save it for the next step.
Step 2: Set up Twilio
You'll need to register a phone number with Twilio and configure it to call your server application whenever someone calls that number. You can also use the Twilio console to update the voice URL for your phone number.
-
Sign up for a Twilio account.
-
Download Twilio CLI.
-
In a new terminal, log in using Twilio CLI. You'll be asked to enter an identifier for your new profile, for example
dev
.twilio login
-
Select the profile you created.
twilio profiles:use <YOUR_PROFILE_ID>
-
Update the voice URL for your phone number.
twilio phone-number:update <YOUR_TWILIO_NUMBER> --voice-url <YOUR_NGROK_URL>
Now, when someone calls your phone number, they'll be forwarded to port 8080 on your local computer. Not having to deploy every change to a cloud instance is going to speed up the development process.
Next up, you'll build the server application to handle the phone call.
Step 3: Transcribe phone calls in real-time
Before we can identify whether a caller is a scammer, we first need to identify what they’re saying. In this step, you’ll build a Go server application that transcribes Twilio phone calls in real-time.
This step uses the application we built in Transcribe phone calls in real-time in Go with Twilio and AssemblyAI. If you want to understand the following code in more detail, refer to the original post.
-
Create and navigate into a new project directory.
mkdir scam-screener-go cd scam-screener-go
-
Initialize your Go module.
go mod init scam-screener-go
-
Install the AssemblyAI Go SDK.
go get github.com/AssemblyAI/assemblyai-go-sdk
-
Install the WebSocket module by Quinn Rivenwell. You'll need this to handle the incoming Twilio MediaStream connection.
go get nhooyr.io/websocket
-
Create a new file called
main.go
with the following code. The starter code uses the finished example from Transcribe phone calls in real-time in Go with Twilio and AssemblyAI to transcribe a Twilio phone call.package main import ( "context" "fmt" "log" "net/http" "os" "nhooyr.io/websocket" "nhooyr.io/websocket/wsjson" aai "github.com/AssemblyAI/assemblyai-go-sdk" ) type realtimeTranscriber struct { } func (t *realtimeTranscriber) SessionBegins(ev aai.SessionBegins) { log.Println("session begins") } func (t *realtimeTranscriber) SessionTerminated(ev aai.SessionTerminated) { log.Println("session terminated") } func (t *realtimeTranscriber) FinalTranscript(transcript aai.FinalTranscript) { fmt.Printf("%s\r\n", transcript.Text) } func (t *realtimeTranscriber) PartialTranscript(transcript aai.PartialTranscript) { // Ignore silence. if transcript.Text == "" { return } fmt.Printf("%s\r", transcript.Text) } func (t *realtimeTranscriber) Error(err error) { log.Println("something bad happened:", err) } var apiKey = os.Getenv("ASSEMBLYAI_API_KEY") func main() { http.HandleFunc("/", twilio) http.HandleFunc("/media", media) log.Println("Server is running on port 8080") if err := http.ListenAndServe(":8080", nil); err != nil { log.Fatal(err) } } func twilio(w http.ResponseWriter, r *http.Request) { if r.Method != "POST" { w.WriteHeader(http.StatusMethodNotAllowed) return } twiML := `<?xml version="1.0" encoding="UTF-8"?> <Response> <Say> Speak to see your audio transcribed in the console. </Say> <Connect> <Stream url='%s' /> </Connect> </Response>` w.Header().Add("Content-Type", "application/xml") fmt.Fprintln(w, fmt.Sprintf(twiML, "wss://"+r.Host+"/media")) } type TwilioMessage struct { Event string `json:"event"` Media struct { // Contains audio samples. Payload []byte `json:"payload"` } `json:"media"` } func media(w http.ResponseWriter, r *http.Request) { // Upgrade HTTP request to WebSocket. c, err := websocket.Accept(w, r, nil) if err != nil { log.Println("unable to upgrade connection to websocket:", err) w.WriteHeader(http.StatusInternalServerError) return } defer c.CloseNow() ctx, cancel := context.WithCancel(r.Context()) defer cancel() handler := new(realtimeTranscriber) client := aai.NewRealTimeClientWithOptions( aai.WithRealTimeAPIKey(apiKey), // Twilio MediaStream sends audio in mu-law format. aai.WithRealTimeEncoding(aai.RealTimeEncodingPCMMulaw), // Twilio MediaStream sends audio at 8000 samples per second. aai.WithRealTimeSampleRate(8000), aai.WithHandler(handler), ) if err := client.Connect(ctx); err != nil { log.Println("unable to connect to real-time transcription:", err) c.Close(websocket.StatusInternalError, err.Error()) return } log.Println("connected to real-time transcription") for { var message TwilioMessage err = wsjson.Read(ctx, c, &message) if err != nil { log.Println("unable to read twilio message:", err) c.Close(websocket.StatusInternalError, err.Error()) return } switch message.Event { case "connected": log.Println("twilio mediastream connected") case "start": log.Println("twilio mediastream started") case "media": if err := client.Send(ctx, message.Media.Payload); err != nil { log.Println("unable to send audio for real-time transcription:", err) c.Close(websocket.StatusInternalError, err.Error()) return } case "stop": log.Println("twilio mediastream stopped") if err := client.Disconnect(ctx, true); err != nil { log.Println("unable to disconnect from real-time transcription:", err) c.Close(websocket.StatusInternalError, err.Error()) return } log.Println("disconnected from real-time transcription") c.Close(websocket.StatusNormalClosure, "") return } } }
Next, you'll inspect the full transcript from the phone call to determine whether it's a scam call or not.
Step 4: Store the full transcript
Real-Time Transcription returns a final transcript once it detects the end of an utterance. Utterances are continuous pieces of speech that are separated by silence. Since a phone call may contain multiple utterances, you need to store all final transcripts for later.
-
Add a field to the
realtimeTranscriber
that will store all final transcripts during the phone call.type realtimeTranscriber struct { finalTranscripts []string }
-
In the
FinalTranscript
method of therealtimeTranscriber
, append each transcript to the field:func (t *realtimeTranscriber) FinalTranscript(transcript aai.FinalTranscript) { fmt.Printf("%s\r\n", transcript.Text) t.finalTranscripts = append(t.finalTranscripts, transcript.Text) }
Once the phone call ends, finalTranscripts
will contain the full transcript for the phone call.
Step 5: Summarize the full transcript using LeMUR
LeMUR is a framework by AssemblyAI that makes it easy to use the power of LLMs with voice data. In this step, you’ll use LeMUR to create a summary of the phone call along with an assessment of whether it’s a scam call or not.
-
Create a new const called
prompt
and write the prompt.const prompt = `Provide a summary of the call, followed by a brief assessment of whether it's a scam call. <context> You're a personal assistant who's helping an elderly person protect themselves from scammers. </context> <answer_format> One-sentence summary in second person. Do not provide a preamble. </answer_format>`
-
Create a new function called summarize with the following content. The function uses LeMUR to create a custom summary using an LLM.
func summarize(ctx context.Context, transcript string) (string, error) { c := aai.NewClient(apiKey) // LeMUR generates a summary and assessment of the transcript using the // provided prompt. resp, err := c.LeMUR.Task(ctx, aai.LeMURTaskParams{ LeMURBaseParams: aai.LeMURBaseParams{ InputText: aai.String(transcript), }, Prompt: aai.String(prompt), }) if err != nil { return "", err } return *resp.Response, nil }
-
Summarize the transcript after the call has ended.
case "stop": log.Println("twilio mediastream stopped") if err := client.Disconnect(ctx, true); err != nil { log.Println("unable to disconnect from real-time transcription:", err) c.Close(websocket.StatusInternalError, err.Error()) return } log.Println("disconnected from real-time transcription") c.Close(websocket.StatusNormalClosure, "") summary, err := summarize(ctx, strings.Join(handler.finalTranscripts, "\n")) if err != nil { log.Println("unable to summarize call:", err) return } log.Println("Summary:", summary) return
-
Start the server to accept calls.
go run main.go
Call the Twilio phone number and leave a voice mail, and hang up. In your terminal, you'll see a similar output to the following:
Learn more
In this tutorial, you’ve learned how to transcribe a phone call, and use LLMs to identify whether it’s a scam attempt. Try and see if you can avoid detection. And if you did, can you think of ways to improve the detection accuracy? Check out Improving your prompt for more inspiration.
You can also experiment with using different LLMs by setting FinalModel on the LeMURBaseParams. For example, by using the faster and cheaper `basic` model, you could prompt LeMUR incrementally after each utterance, to provide a live assessment during the call. For more information on supported LLMs, refer to Change the model type in our docs.
To learn more, check out the following resources:
- Transcribe streaming audio from a microphone in Go
- Processing audio with LLMs using LeMUR
- Reference documentation for AssemblyAI Go SDK
To keep up with more content like this, make sure you subscribe to our newsletter and join our Discord server.
Complete code example
package main
import (
"context"
"fmt"
"log"
"net/http"
"os"
"strings"
"nhooyr.io/websocket"
"nhooyr.io/websocket/wsjson"
aai "github.com/AssemblyAI/assemblyai-go-sdk"
)
type realtimeTranscriber struct {
finalTranscripts []string
}
func (t *realtimeTranscriber) SessionBegins(ev aai.SessionBegins) {
log.Println("session begins")
}
func (t *realtimeTranscriber) SessionTerminated(ev aai.SessionTerminated) {
log.Println("session terminated")
}
func (t *realtimeTranscriber) FinalTranscript(transcript aai.FinalTranscript) {
fmt.Printf("%s\r\n", transcript.Text)
t.finalTranscripts = append(t.finalTranscripts, transcript.Text)
}
func (t *realtimeTranscriber) PartialTranscript(transcript aai.PartialTranscript) {
// Ignore silence.
if transcript.Text == "" {
return
}
fmt.Printf("%s\r", transcript.Text)
}
func (t *realtimeTranscriber) Error(err error) {
log.Println("something bad happened:", err)
}
var apiKey = os.Getenv("ASSEMBLYAI_API_KEY")
func main() {
http.HandleFunc("/", twilio)
http.HandleFunc("/media", media)
log.Println("Server is running on port 8080")
if err := http.ListenAndServe(":8080", nil); err != nil {
log.Fatal(err)
}
}
func twilio(w http.ResponseWriter, r *http.Request) {
if r.Method != "POST" {
w.WriteHeader(http.StatusMethodNotAllowed)
return
}
twiML := `<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Say>
Speak to see your audio transcribed in the console.
</Say>
<Connect>
<Stream url='%s' />
</Connect>
</Response>`
w.Header().Add("Content-Type", "application/xml")
fmt.Fprintln(w, fmt.Sprintf(twiML, "wss://"+r.Host+"/media"))
}
type TwilioMessage struct {
Event string `json:"event"`
Media struct {
// Contains audio samples.
Payload []byte `json:"payload"`
} `json:"media"`
}
func media(w http.ResponseWriter, r *http.Request) {
// Upgrade HTTP request to WebSocket.
c, err := websocket.Accept(w, r, nil)
if err != nil {
log.Println("unable to upgrade connection to websocket:", err)
w.WriteHeader(http.StatusInternalServerError)
return
}
defer c.CloseNow()
ctx, cancel := context.WithCancel(r.Context())
defer cancel()
handler := new(realtimeTranscriber)
client := aai.NewRealTimeClientWithOptions(
aai.WithRealTimeAPIKey(apiKey),
// Twilio MediaStream sends audio in mu-law format.
aai.WithRealTimeEncoding(aai.RealTimeEncodingPCMMulaw),
// Twilio MediaStream sends audio at 8000 samples per second.
aai.WithRealTimeSampleRate(8000),
aai.WithHandler(handler),
)
if err := client.Connect(ctx); err != nil {
log.Println("unable to connect to real-time transcription:", err)
c.Close(websocket.StatusInternalError, err.Error())
return
}
log.Println("connected to real-time transcription")
for {
var message TwilioMessage
err = wsjson.Read(ctx, c, &message)
if err != nil {
log.Println("unable to read twilio message:", err)
c.Close(websocket.StatusInternalError, err.Error())
return
}
switch message.Event {
case "connected":
log.Println("twilio mediastream connected")
case "start":
log.Println("twilio mediastream started")
case "media":
if err := client.Send(ctx, message.Media.Payload); err != nil {
log.Println("unable to send audio for real-time transcription:", err)
c.Close(websocket.StatusInternalError, err.Error())
return
}
case "stop":
log.Println("twilio mediastream stopped")
if err := client.Disconnect(ctx, true); err != nil {
log.Println("unable to disconnect from real-time transcription:", err)
c.Close(websocket.StatusInternalError, err.Error())
return
}
log.Println("disconnected from real-time transcription")
c.Close(websocket.StatusNormalClosure, "")
summary, err := summarize(ctx, strings.Join(handler.finalTranscripts, "\n"))
if err != nil {
log.Println("unable to summarize call:", err)
return
}
log.Println("Summary:", summary)
return
}
}
}
const prompt = `Provide a summary of the call, followed by a brief assessment of whether it's a scam call.
<context>
You're a personal assistant who's helping an elderly person protect themselves from scammers.
</context>
<answer_format>
One-sentence summary in second person. Do not provide a preamble.
</answer_format>`
func summarize(ctx context.Context, transcript string) (string, error) {
c := aai.NewClient(apiKey)
// LeMUR generates a summary and assessment of the transcript using the
// provided prompt.
resp, err := c.LeMUR.Task(ctx, aai.LeMURTaskParams{
LeMURBaseParams: aai.LeMURBaseParams{
InputText: aai.String(transcript),
},
Prompt: aai.String(prompt),
})
if err != nil {
return "", err
}
return *resp.Response, nil
}