If you've ever wanted to create a personal AI assistant like Siri or Alexa, the first step is to figure out how to trigger the AI using a specific word or phrase (also known as a hotword). All of the prevalent AI systems use a similar approach; for Alexa, the hotword is "Alexa," and for Siri, the hotword is "Hey Siri."
In this tutorial, you'll learn how to implement a hotword detection system using AssemblyAI's Streaming Speech-to-Text API. In homage to Iron Man, the assistant in the example is called Jarvis, but you're welcome to name it whatever you want. This tutorial uses the AssemblyAI Go SDK, but if Go isn't your preferred language, you're welcome to use any other supported programming languages.
Before you start
To complete this tutorial, you'll need:
Set up your environment
You'll use the Go bindings of PortAudio to get raw audio data from your microphone and the AssemblyAI Go SDK to interface with AssemblyAI.
Let's start by creating a new Go project and setting up all of the required dependencies. To do so, navigate to your preferred directory and run the following commands:
mkdir jarvis
cd jarvis
go mod init jarvis
go get github.com/gordonklaus/portaudio
go get github.com/AssemblyAI/assemblyai-go-sdk
If the execution is successful, you should end up with the following directory structure:
jarvis
├── go.mod
└── go.sum
Next up, you'll need an AssemblyAI API key.
Create an AssemblyAI account
To get an API key, you first need to create an AssemblyAI account. Go to the sign-up page and fill out the form to get started.
Once your account is created, you need to set up billing details. Streaming Speech-to-Text is among a few selected APIs that aren't available on the free plan. You can set up billing by going to Billing and providing valid credit card details.
Next, go to the dashboard and take note of your API key, as you'll need it in the next step:
Record the raw audio data
With the dependencies set up and the API key in hand, you're ready to start implementing the core logic of your personal AI assistant. The first step is to figure out how to get raw audio data from the microphone. As mentioned earlier, you'll be using the Go bindings of the well-known PortAudio I/O library, which makes it easy to get raw data and manipulate low-level options like the sampling rate and the number of frames per buffer. This is important, as AssemblyAI is sensitive to these options and might generate an inaccurate transcript if they aren't set correctly.
Create a new recorder.go
file in the jarvis
directory, import the dependencies, and define a new recorder
struct:
package main
import (
"bytes"
"encoding/binary"
"github.com/gordonklaus/portaudio"
)
type recorder struct {
stream *portaudio.Stream
in []int16
}
The recorder
struct will hold a reference to the input stream and read data from that stream via the in
field of the struct.
Use the following code to configure a newRecorder
function to create and initialize a new recorder
struct:
func newRecorder(sampleRate int, framesPerBuffer int) (*recorder, error) {
in := make([]int16, framesPerBuffer)
stream, err := portaudio.OpenDefaultStream(1, 0, float64(sampleRate), framesPerBuffer, in)
if err != nil {
return nil, err
}
return &recorder{
stream: stream,
in: in,
}, nil
}
This function takes in the required sample rate and frames per buffer to configure PortAudio, opens the default audio input device attached to your computer, and returns a pointer to a new recorder
struct. You might want to use OpenStream
instead of OpenDefaultStream
if you have multiple mics connected to your computer and want to use a specific one.
Next, define a few methods on the recorder
struct pointer that you'll use in the next step:
func (r *recorder) Read() ([]byte, error) {
if err := r.stream.Read(); err != nil {
return nil, err
}
buf := new(bytes.Buffer)
if err := binary.Write(buf, binary.LittleEndian, r.in); err != nil {
return nil, err
}
return buf.Bytes(), nil
}
func (r *recorder) Start() error {
return r.stream.Start()
}
func (r *recorder) Stop() error {
return r.stream.Stop()
}
func (r *recorder) Close() error {
return r.stream.Close()
}
The Read
method reads data from the input stream, writes it to a buffer, and then returns that buffer.
The Start
, Stop
, and Close
methods call similarly named methods on the stream and don't do anything unique.
Create a real-time transcriber
AssemblyAI divides each session with the streaming API into multiple discrete events and requires an event handler for each of these events. These event handlers are defined as field functions on the RealTimeTranscriber
struct type that is provided by AssemblyAI.
Here are the different field functions that the RealTimeTranscriber
accepts:
transcriber := &assemblyai.RealTimeTranscriber{
OnSessionBegins: func(event aai.SessionBegins) {
// ...
},
OnSessionTerminated: func(event aai.SessionTerminated) {
// ...
},
OnPartialTranscript: func(event aai.PartialTranscript) {
// ...
},
OnFinalTranscript: func(event aai.FinalTranscript) {
// ...
},
OnSessionInformation: func(event aai.SessionInformation) {
// ...
},
OnError: func(err error) {
// ...
},
}
You only need to provide implementations for the field functions you want to use. The others can be omitted if they aren't needed.
Create a new main.go
file and add this transcriber struct:
package main
import (
"fmt"
"github.com/AssemblyAI/assemblyai-go-sdk"
)
var transcriber = &assemblyai.RealTimeTranscriber{
OnSessionBegins: func(event assemblyai.SessionBegins) {
fmt.Println("session begins")
},
OnSessionTerminated: func(event assemblyai.SessionTerminated) {
fmt.Println("session terminated")
},
OnPartialTranscript: func(event assemblyai.PartialTranscript) {
fmt.Printf("%s\r", event.Text)
},
OnFinalTranscript: func(event assemblyai.FinalTranscript) {
fmt.Println(event.Text)
},
OnError: func(err error) {
fmt.Println(err)
},
}
This struct has all of the required field functions defined that you'll be using in this tutorial.
The events fire as follows:
OnSessionBegins
when the connection is establishedOnSessionTerminated
when the connection is terminatedOnPartialTranscript
event when AssemblyAI is transcribing a new sentenceOnFinalTranscript
event when a sentence is completely transcribed
OnPartialTranscript
fires repeatedly until a sentence is complete, and each invocation will contain the complete sentence up to that point. Only after the sentence is completely transcribed will the FinalTranscript
event fire with the complete transcribed sentence.
The OnError
function simply handles any errors that may occur during a session.
The benefit of using a carriage return (\r
) in the OnPartialTranscript
function is that it'll overwrite the same line in the terminal whenever the OnPartialTranscript
function is called. This way, it won't clutter your screen by printing each partial output on a new line.
To add hotword detection support, you need to define a hotword
variable that'll be populated via a command line argument and be compared against in the OnFinalTranscript
function:
package main
import (
"context"
"fmt"
"log"
"os"
"os/signal"
"strings"
"syscall"
"github.com/AssemblyAI/assemblyai-go-sdk"
"github.com/gordonklaus/portaudio"
)
var hotword string
var transcriber = &assemblyai.RealTimeTranscriber{
// truncated
OnFinalTranscript: func(event assemblyai.FinalTranscript) {
fmt.Println(event.Text)
hotwordDetected := strings.Contains(
strings.ToLower(event.Text),
strings.ToLower(hotword),
)
if hotwordDetected {
fmt.Println("I am here!")
}
},
// truncated
}
So far, the code doesn't contain the logic for populating the hotword
variable. That'll be done in the main
function that you'll write next.
Stitch everything together
With all the required pieces in place, the only thing left to do is define a main
function, invoke the AssemblyAI API, and pass in the raw audio data. Let's start the main
function by setting up a logger:
func main() {
logger := log.New(os.Stderr, "", log.Lshortfile)
}
This logger will output the logs to stderr.
Note: The rest of the code in this section will also be written in the body of the main
function.
Next, you need to initialize PortAudio:
// Use PortAudio to record the microphone
portaudio.Initialize()
defer portaudio.Terminate()
This initializes some internal data structures and lets you open up the input stream later on, which is required before you can use any PortAudio API functions.
Let's populate the hotword
variable next from a command line argument:
hotword = os.Args[1]
Optionally, you can also add a print statement after this to print the hotword to the screen:
fmt.Println(hotword)
Now, you need to set up a few variables for the AssemblyAI API key, input sample rate, and input frames per buffer:
device, err := portaudio.DefaultInputDevice()
if err != nil {
logger.Fatal(err)
}
var (
apiKey = os.Getenv("ASSEMBLYAI_API_KEY")
// Number of samples per second
sampleRate = device.DefaultSampleRate
// Number of samples to send at once
framesPerBuffer = int(0.2 * sampleRate) // 200 ms of audio
)
This code takes the API key from an environment variable and sets up the sampleRate
and framesPerBuffer
variables by letting PortAudio supply the configured sample rate of the default input device. This way, you don't have to manually check what the sample rate of the input device is, and it'll automatically be set correctly.
It's finally time to create an AssemblyAI API client, a new recorder, and send data from the recorder to the API. Use the following code:
client := assemblyai.NewRealTimeClientWithOptions(
assemblyai.WithRealTimeAPIKey(apiKey),
assemblyai.WithRealTimeSampleRate(int(sampleRate)),
assemblyai.WithRealTimeTranscriber(transcriber),
)
ctx := context.Background()
if err := client.Connect(ctx); err != nil {
logger.Fatal(err)
}
rec, err := newRecorder(int(sampleRate), framesPerBuffer)
if err != nil {
logger.Fatal(err)
}
if err := rec.Start(); err != nil {
logger.Fatal(err)
}
This code passes the transcriber
to AssemblyAI while creating a new real-time client. It also passes the sampleRate
of the microphone to AssemblyAI, as the streaming speech-to-text would fail without it. Once the client is created, it opens up a WebSocket connection to AssemblyAI via the client.Connect
call. Next, it creates a new recorder using the newRecorder
function and starts recording via the rec.Start()
method. If anything fails in either of these stages, the code prints the error and exits.
Now you need to add an infinite for
loop that gets the data from the microphone and sends it to AssemblyAI:
for {
b, err := rec.Read()
if err != nil {
logger.Fatal(err)
}
// Send partial audio samples
if err := client.Send(ctx, b); err != nil {
logger.Fatal(err)
}
}
If you try running the code you've written so far, it should work. However, it's lacking a function for proper resource cleanup. Let's make sure the code catches any termination signals and cleans up the resources appropriately. There are two changes you need to make for this to work. First, you need to create a new channel that'll be notified in case of a SIGINT
or SIGTERM
signal. Put this code at the top of your main
function:
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
Next, you need to modify your for
loop to add a select
statement and make sure you clean up the resources in case you receive something in the sigs
channel:
for {
select {
case <-sigs:
fmt.Println("stopping recording...")
if err := rec.Stop(); err != nil {
log.Fatal(err)
}
if err := client.Disconnect(ctx, true); err != nil {
log.Fatal(err)
}
os.Exit(0)
default:
b, err := rec.Read()
if err != nil {
logger.Fatal(err)
}
// Send partial audio samples
if err := client.Send(ctx, b); err != nil {
logger.Fatal(err)
}
}
}
Due to the introduction of the select
statement, the code will always check if there is anything in the sigs
channel. If so, it'll run the cleanup tasks. Otherwise, it'll continue reading data from the microphone and passing it to AssemblyAI.
As part of the cleanup, the code stops the recording using the Stop()
method and disconnects the WebSocket connection to AssemblyAI via the client.Disconnect()
method.
Review the complete code
By now, you should have a main.go
file and a recorder.go
file. The recorder.go
file should resemble this:
package main
import (
"bytes"
"encoding/binary"
"github.com/gordonklaus/portaudio"
)
type recorder struct {
stream *portaudio.Stream
in []int16
}
func newRecorder(sampleRate int, framesPerBuffer int) (*recorder, error) {
in := make([]int16, framesPerBuffer)
stream, err := portaudio.OpenDefaultStream(1, 0, float64(sampleRate), framesPerBuffer, in)
if err != nil {
return nil, err
}
return &recorder{
stream: stream,
in: in,
}, nil
}
func (r *recorder) Read() ([]byte, error) {
if err := r.stream.Read(); err != nil {
return nil, err
}
buf := new(bytes.Buffer)
if err := binary.Write(buf, binary.LittleEndian, r.in); err != nil {
return nil, err
}
return buf.Bytes(), nil
}
func (r *recorder) Start() error {
return r.stream.Start()
}
func (r *recorder) Stop() error {
return r.stream.Stop()
}
func (r *recorder) Close() error {
return r.stream.Close()
}
And the main.go
file should resemble this:
package main
import (
"context"
"fmt"
"log"
"os"
"os/signal"
"strings"
"syscall"
"github.com/AssemblyAI/assemblyai-go-sdk"
"github.com/gordonklaus/portaudio"
)
var hotword string
var transcriber = &assemblyai.RealTimeTranscriber{
OnSessionBegins: func(event assemblyai.SessionBegins) {
fmt.Println("session begins")
},
OnSessionTerminated: func(event assemblyai.SessionTerminated) {
fmt.Println("session terminated")
},
OnPartialTranscript: func(event assemblyai.PartialTranscript) {
fmt.Printf("%s\r", event.Text)
},
OnFinalTranscript: func(event assemblyai.FinalTranscript) {
fmt.Println(event.Text)
hotwordDetected := strings.Contains(
strings.ToLower(event.Text),
strings.ToLower(hotword),
)
if hotwordDetected {
fmt.Println("I am here!")
}
},
OnError: func(err error) {
fmt.Println(err)
},
}
func main() {
sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)
logger := log.New(os.Stderr, "", log.Lshortfile)
// Use PortAudio to record the microphone
portaudio.Initialize()
defer portaudio.Terminate()
hotword = os.Args[1]
device, err := portaudio.DefaultInputDevice()
if err != nil {
logger.Fatal(err)
}
var (
apiKey = os.Getenv("ASSEMBLYAI_API_KEY")
// Number of samples per second
sampleRate = device.DefaultSampleRate
// Number of samples to send at once
framesPerBuffer = int(0.2 * sampleRate) // 200 ms of audio
)
client := assemblyai.NewRealTimeClientWithOptions(
assemblyai.WithRealTimeAPIKey(apiKey),
assemblyai.WithRealTimeSampleRate(int(sampleRate)),
assemblyai.WithRealTimeTranscriber(transcriber),
)
ctx := context.Background()
if err := client.Connect(ctx); err != nil {
logger.Fatal(err)
}
rec, err := newRecorder(int(sampleRate), framesPerBuffer)
if err != nil {
logger.Fatal(err)
}
if err := rec.Start(); err != nil {
logger.Fatal(err)
}
for {
select {
case <-sigs:
fmt.Println("stopping recording...")
if err := rec.Stop(); err != nil {
log.Fatal(err)
}
if err := client.Disconnect(ctx, true); err != nil {
log.Fatal(err)
}
os.Exit(0)
default:
b, err := rec.Read()
if err != nil {
logger.Fatal(err)
}
// Send partial audio samples
if err := client.Send(ctx, b); err != nil {
logger.Fatal(err)
}
}
}
}
Run the application
Let's see the results of your work in action by running the code. Open up the terminal, cd
into the project directory, and set up your AssemblyAI API key as an environment variable:
export ASSEMBLYAI_API_KEY='***'
Note: Replace ***
in the command above with your AssemblyAI API key.
Finally, run this command:
go run . Jarvis
This will set Jarvis
as the hotword, and the code will output How may I be of service?
whenever it sees Jarvis
in the output from AssemblyAI.
Conclusion
In this tutorial, you learned how to create an application that detects a hotword using the AssemblyAI API. You saw how PortAudio makes it easy to get raw data from a microphone and how AssemblyAI allows you to make sense of that data and transcribe it.
AssemblyAI's Streaming Speech-to-Text API opens up a world of possibilities for developers seeking to enhance their applications with cutting-edge AI.
Whether it's transcribing phone calls for customer service analytics, generating subtitles for video content, or creating accessibility solutions for the hearing impaired, AssemblyAI's powerful Speech AI models provide a versatile toolkit for developers to innovate and improve user experiences with the power of voice. With its high accuracy, ease of integration, and streaming capabilities, AssemblyAI empowers developers to unlock the full potential of voice data in their applications.