How to Do Hotword Detection with Streaming Speech-to-Text and Go

If you've ever wanted to create a personal AI assistant like Siri or Alexa, the first step is to figure out how to trigger the AI using a specific word or phrase (also known as a hotword). All of the prevalent AI systems use a similar approach; for Alexa, the hotword is "Alexa," and for Siri, the hotword is "Hey Siri."

In this tutorial, you'll learn how to implement a hotword detection system using AssemblyAI's Streaming Speech-to-Text API. In homage to Iron Man, the assistant in the example is called Jarvis, but you're welcome to name it whatever you want. This tutorial uses the AssemblyAI Go SDK, but if Go isn't your preferred language, you're welcome to use any other supported programming languages.

Before you start

To complete this tutorial, you'll need:

Go installed
PortAudio installed

Set up your environment

You'll use the Go bindings of PortAudio to get raw audio data from your microphone and the AssemblyAI Go SDK to interface with AssemblyAI.

Let's start by creating a new Go project and setting up all of the required dependencies. To do so, navigate to your preferred directory and run the following commands:

mkdir jarvis
cd jarvis
go mod init jarvis
go get github.com/gordonklaus/portaudio
go get github.com/AssemblyAI/assemblyai-go-sdk

If the execution is successful, you should end up with the following directory structure:

jarvis
├── go.mod
└── go.sum

Next up, you'll need an AssemblyAI API key.

Create an AssemblyAI account

To get an API key, you first need to create an AssemblyAI account. Go to the sign-up page and fill out the form to get started.

Once your account is created, you need to set up billing details. Streaming Speech-to-Text is among a few selected APIs that aren't available on the free plan. You can set up billing by going to Billing and providing valid credit card details.

Next, go to the dashboard and take note of your API key, as you'll need it in the next step:

Record the raw audio data

With the dependencies set up and the API key in hand, you're ready to start implementing the core logic of your personal AI assistant. The first step is to figure out how to get raw audio data from the microphone. As mentioned earlier, you'll be using the Go bindings of the well-known PortAudio I/O library, which makes it easy to get raw data and manipulate low-level options like the sampling rate and the number of frames per buffer. This is important, as AssemblyAI is sensitive to these options and might generate an inaccurate transcript if they aren't set correctly.

Create a new recorder.go file in the jarvis directory, import the dependencies, and define a new recorder struct:

package main

import (
    "bytes"
    "encoding/binary"

    "github.com/gordonklaus/portaudio"
)

type recorder struct {
    stream *portaudio.Stream
    in     []int16
}

The recorder struct will hold a reference to the input stream and read data from that stream via the in field of the struct.

Use the following code to configure a newRecorder function to create and initialize a new recorder struct:

func newRecorder(sampleRate int, framesPerBuffer int) (*recorder, error) {
    in := make([]int16, framesPerBuffer)

    stream, err := portaudio.OpenDefaultStream(1, 0, float64(sampleRate), framesPerBuffer, in)
    if err != nil {
        return nil, err
    }

    return &recorder{
        stream: stream,
        in:     in,
    }, nil
}

This function takes in the required sample rate and frames per buffer to configure PortAudio, opens the default audio input device attached to your computer, and returns a pointer to a new recorder struct. You might want to use OpenStream instead of OpenDefaultStream if you have multiple mics connected to your computer and want to use a specific one.

Next, define a few methods on the recorder struct pointer that you'll use in the next step:

func (r *recorder) Read() ([]byte, error) {
    if err := r.stream.Read(); err != nil {
        return nil, err
    }



    buf := new(bytes.Buffer)

    if err := binary.Write(buf, binary.LittleEndian, r.in); err != nil {
        return nil, err
    }

    return buf.Bytes(), nil
}

func (r *recorder) Start() error {
    return r.stream.Start()
}

func (r *recorder) Stop() error {
    return r.stream.Stop()
}

func (r *recorder) Close() error {
    return r.stream.Close()
}

The Read method reads data from the input stream, writes it to a buffer, and then returns that buffer.

The Start, Stop, and Close methods call similarly named methods on the stream and don't do anything unique.

Create a real-time transcriber

AssemblyAI divides each session with the streaming API into multiple discrete events and requires an event handler for each of these events. These event handlers are defined as field functions on the RealTimeTranscriber struct type that is provided by AssemblyAI.

Here are the different field functions that the RealTimeTranscriber accepts:

transcriber := &assemblyai.RealTimeTranscriber{
    OnSessionBegins: func(event aai.SessionBegins) {
        // ...
    },
    OnSessionTerminated: func(event aai.SessionTerminated) {
        // ...
    },
    OnPartialTranscript: func(event aai.PartialTranscript) {
        // ...
    },
    OnFinalTranscript: func(event aai.FinalTranscript) {
        // ...
    },
    OnSessionInformation: func(event aai.SessionInformation) {
        // ...
    },
    OnError: func(err error) {
        // ...
    },
}

You only need to provide implementations for the field functions you want to use. The others can be omitted if they aren't needed.

Create a new main.go file and add this transcriber struct:

package main

import (
    "fmt"

    "github.com/AssemblyAI/assemblyai-go-sdk"
)

var transcriber = &assemblyai.RealTimeTranscriber{
    OnSessionBegins: func(event assemblyai.SessionBegins) {
        fmt.Println("session begins")
    },

    OnSessionTerminated: func(event assemblyai.SessionTerminated) {
        fmt.Println("session terminated")
    },

    OnPartialTranscript: func(event assemblyai.PartialTranscript) {
        fmt.Printf("%s\r", event.Text)
    },

    OnFinalTranscript: func(event assemblyai.FinalTranscript) {
        fmt.Println(event.Text)
    },

    OnError: func(err error) {
        fmt.Println(err)
    },
}

This struct has all of the required field functions defined that you'll be using in this tutorial.

The events fire as follows:

OnSessionBegins when the connection is established
OnSessionTerminated when the connection is terminated
OnPartialTranscript event when AssemblyAI is transcribing a new sentence
OnFinalTranscript event when a sentence is completely transcribed

OnPartialTranscript fires repeatedly until a sentence is complete, and each invocation will contain the complete sentence up to that point. Only after the sentence is completely transcribed will the FinalTranscript event fire with the complete transcribed sentence.

The OnError function simply handles any errors that may occur during a session.

The benefit of using a carriage return (\r) in the OnPartialTranscript function is that it'll overwrite the same line in the terminal whenever the OnPartialTranscript function is called. This way, it won't clutter your screen by printing each partial output on a new line.

To add hotword detection support, you need to define a hotword variable that'll be populated via a command line argument and be compared against in the OnFinalTranscript function:

package main

import (
    "context"
    "fmt"
    "log"
    "os"
    "os/signal"
    "strings"
    "syscall"

    "github.com/AssemblyAI/assemblyai-go-sdk"
    "github.com/gordonklaus/portaudio"
)

var hotword string

var transcriber = &assemblyai.RealTimeTranscriber{
    // truncated

    OnFinalTranscript: func(event assemblyai.FinalTranscript) {
        fmt.Println(event.Text)

        hotwordDetected := strings.Contains(
            strings.ToLower(event.Text),
            strings.ToLower(hotword),
        )

        if hotwordDetected {
            fmt.Println("I am here!")
        }
    },

   // truncated
}

So far, the code doesn't contain the logic for populating the hotword variable. That'll be done in the main function that you'll write next.

Stitch everything together

With all the required pieces in place, the only thing left to do is define a main function, invoke the AssemblyAI API, and pass in the raw audio data. Let's start the main function by setting up a logger:

func main() {
    logger := log.New(os.Stderr, "", log.Lshortfile)
}

This logger will output the logs to stderr.

Note: The rest of the code in this section will also be written in the body of the main function.

Next, you need to initialize PortAudio:

// Use PortAudio to record the microphone
portaudio.Initialize()
defer portaudio.Terminate()

This initializes some internal data structures and lets you open up the input stream later on, which is required before you can use any PortAudio API functions.

Let's populate the hotword variable next from a command line argument:

hotword = os.Args[1]

Optionally, you can also add a print statement after this to print the hotword to the screen:

fmt.Println(hotword)

Now, you need to set up a few variables for the AssemblyAI API key, input sample rate, and input frames per buffer:

device, err := portaudio.DefaultInputDevice()
if err != nil {
    logger.Fatal(err)
}

var (
    apiKey = os.Getenv("ASSEMBLYAI_API_KEY")

    // Number of samples per second
    sampleRate = device.DefaultSampleRate

    // Number of samples to send at once
    framesPerBuffer = int(0.2 * sampleRate) // 200 ms of audio
)

This code takes the API key from an environment variable and sets up the sampleRate and framesPerBuffer variables by letting PortAudio supply the configured sample rate of the default input device. This way, you don't have to manually check what the sample rate of the input device is, and it'll automatically be set correctly.

It's finally time to create an AssemblyAI API client, a new recorder, and send data from the recorder to the API. Use the following code:

client := assemblyai.NewRealTimeClientWithOptions(
    assemblyai.WithRealTimeAPIKey(apiKey),
    assemblyai.WithRealTimeSampleRate(int(sampleRate)),
    assemblyai.WithRealTimeTranscriber(transcriber),
)

ctx := context.Background()

if err := client.Connect(ctx); err != nil {
    logger.Fatal(err)
}

rec, err := newRecorder(int(sampleRate), framesPerBuffer)
if err != nil {
    logger.Fatal(err)
}

if err := rec.Start(); err != nil {
    logger.Fatal(err)
}

This code passes the transcriber to AssemblyAI while creating a new real-time client. It also passes the sampleRate of the microphone to AssemblyAI, as the streaming speech-to-text would fail without it. Once the client is created, it opens up a WebSocket connection to AssemblyAI via the client.Connect call. Next, it creates a new recorder using the newRecorder function and starts recording via the rec.Start() method. If anything fails in either of these stages, the code prints the error and exits.

Now you need to add an infinite for loop that gets the data from the microphone and sends it to AssemblyAI:

for {
    b, err := rec.Read()
    if err != nil {
        logger.Fatal(err)
    }

    // Send partial audio samples
    if err := client.Send(ctx, b); err != nil {
        logger.Fatal(err)
    }
}

If you try running the code you've written so far, it should work. However, it's lacking a function for proper resource cleanup. Let's make sure the code catches any termination signals and cleans up the resources appropriately. There are two changes you need to make for this to work. First, you need to create a new channel that'll be notified in case of a SIGINT or SIGTERM signal. Put this code at the top of your main function:

sigs := make(chan os.Signal, 1)
signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)

Next, you need to modify your for loop to add a select statement and make sure you clean up the resources in case you receive something in the sigs channel:

for {
    select {
    case <-sigs:
        fmt.Println("stopping recording...")
        if err := rec.Stop(); err != nil {
            log.Fatal(err)
        }
        if err := client.Disconnect(ctx, true); err != nil {
            log.Fatal(err)
        }
        os.Exit(0)
    default:
        b, err := rec.Read()
        if err != nil {
            logger.Fatal(err)
        }

        // Send partial audio samples
        if err := client.Send(ctx, b); err != nil {
            logger.Fatal(err)
        }
    }
}

Due to the introduction of the select statement, the code will always check if there is anything in the sigs channel. If so, it'll run the cleanup tasks. Otherwise, it'll continue reading data from the microphone and passing it to AssemblyAI.

As part of the cleanup, the code stops the recording using the Stop() method and disconnects the WebSocket connection to AssemblyAI via the client.Disconnect() method.

Review the complete code

By now, you should have a main.go file and a recorder.go file. The recorder.go file should resemble this:

package main

import (
    "bytes"
    "encoding/binary"

    "github.com/gordonklaus/portaudio"
)

type recorder struct {
    stream *portaudio.Stream
    in     []int16
}

func newRecorder(sampleRate int, framesPerBuffer int) (*recorder, error) {
    in := make([]int16, framesPerBuffer)

    stream, err := portaudio.OpenDefaultStream(1, 0, float64(sampleRate), framesPerBuffer, in)
    if err != nil {
        return nil, err
    }

    return &recorder{
        stream: stream,
        in:     in,
    }, nil
}

func (r *recorder) Read() ([]byte, error) {
    if err := r.stream.Read(); err != nil {
        return nil, err
    }

    buf := new(bytes.Buffer)

    if err := binary.Write(buf, binary.LittleEndian, r.in); err != nil {
        return nil, err
    }

    return buf.Bytes(), nil
}

func (r *recorder) Start() error {
    return r.stream.Start()
}

func (r *recorder) Stop() error {
    return r.stream.Stop()
}

func (r *recorder) Close() error {
    return r.stream.Close()
}

And the main.go file should resemble this:

package main

import (
    "context"
    "fmt"
    "log"
    "os"
    "os/signal"
    "strings"
    "syscall"

    "github.com/AssemblyAI/assemblyai-go-sdk"
    "github.com/gordonklaus/portaudio"
)

var hotword string

var transcriber = &assemblyai.RealTimeTranscriber{
    OnSessionBegins: func(event assemblyai.SessionBegins) {
        fmt.Println("session begins")
    },

    OnSessionTerminated: func(event assemblyai.SessionTerminated) {
        fmt.Println("session terminated")
    },

    OnPartialTranscript: func(event assemblyai.PartialTranscript) {
        fmt.Printf("%s\r", event.Text)
    },

    OnFinalTranscript: func(event assemblyai.FinalTranscript) {
        fmt.Println(event.Text)
        hotwordDetected := strings.Contains(
            strings.ToLower(event.Text),
            strings.ToLower(hotword),
        )
        if hotwordDetected {
            fmt.Println("I am here!")
        }
    },

    OnError: func(err error) {
        fmt.Println(err)
    },
}

func main() {
    sigs := make(chan os.Signal, 1)
    signal.Notify(sigs, syscall.SIGINT, syscall.SIGTERM)

    logger := log.New(os.Stderr, "", log.Lshortfile)

    // Use PortAudio to record the microphone
    portaudio.Initialize()
    defer portaudio.Terminate()

    hotword = os.Args[1]

    device, err := portaudio.DefaultInputDevice()
    if err != nil {
        logger.Fatal(err)
    }

    var (
        apiKey = os.Getenv("ASSEMBLYAI_API_KEY")

        // Number of samples per second
        sampleRate = device.DefaultSampleRate

        // Number of samples to send at once
        framesPerBuffer = int(0.2 * sampleRate) // 200 ms of audio
    )

    client := assemblyai.NewRealTimeClientWithOptions(
        assemblyai.WithRealTimeAPIKey(apiKey),
        assemblyai.WithRealTimeSampleRate(int(sampleRate)),
        assemblyai.WithRealTimeTranscriber(transcriber),
    )

    ctx := context.Background()

    if err := client.Connect(ctx); err != nil {
        logger.Fatal(err)
    }

    rec, err := newRecorder(int(sampleRate), framesPerBuffer)
    if err != nil {
        logger.Fatal(err)
    }

    if err := rec.Start(); err != nil {
        logger.Fatal(err)
    }

    for {
        select {
        case <-sigs:
            fmt.Println("stopping recording...")
            if err := rec.Stop(); err != nil {
                log.Fatal(err)
            }
            if err := client.Disconnect(ctx, true); err != nil {
                log.Fatal(err)
            }
            os.Exit(0)
        default:
            b, err := rec.Read()
            if err != nil {
                logger.Fatal(err)
            }

            // Send partial audio samples
            if err := client.Send(ctx, b); err != nil {
                logger.Fatal(err)
            }
        }
    }
}

Run the application

Let's see the results of your work in action by running the code. Open up the terminal, cd into the project directory, and set up your AssemblyAI API key as an environment variable:

export ASSEMBLYAI_API_KEY='***'

Note: Replace *** in the command above with your AssemblyAI API key.

Finally, run this command:

go run . Jarvis

This will set Jarvis as the hotword, and the code will output How may I be of service? whenever it sees Jarvis in the output from AssemblyAI.

Conclusion

In this tutorial, you learned how to create an application that detects a hotword using the AssemblyAI API. You saw how PortAudio makes it easy to get raw data from a microphone and how AssemblyAI allows you to make sense of that data and transcribe it.

AssemblyAI's Streaming Speech-to-Text API opens up a world of possibilities for developers seeking to enhance their applications with cutting-edge AI.

Whether it's transcribing phone calls for customer service analytics, generating subtitles for video content, or creating accessibility solutions for the hearing impaired, AssemblyAI's powerful Speech AI models provide a versatile toolkit for developers to innovate and improve user experiences with the power of voice. With its high accuracy, ease of integration, and streaming capabilities, AssemblyAI empowers developers to unlock the full potential of voice data in their applications.