Transcribe streaming audio from a microphone in C#
Learn how to transcribe streaming audio in C#.
Overview
By the end of this tutorial, you’ll be able to transcribe audio from your microphone in C#.
Supported languages
Streaming Speech-to-Text is only available for English.
Before you begin
To complete this tutorial, you need:
- .NET 9 (earlier versions will work too with minor adjustments)
- An AssemblyAI account with credit card set up.
Here’s the full sample code for what you’ll build in this tutorial:
Step 1: Setting up your project
Step 2: Define the program structure and assign your API key
Browse to Account, and then click the text under Your API key to copy it.
Step 3: Record audio from the microphone
In this step, you’ll setup microphone recording using SoX.
Create a CaptureAndSendAudioAsync
method to handle the microphone recording and audio streaming:
Audio data format
The SoX arguments configure the format of the audio output. The arguments configure the format to a single channel with 16-bit signed integer PCM encoding and 16 kHz sample rate.
If you want to stream data from elsewhere, make sure that your audio data is in the following format:
- Single channel
- 16-bit signed integer PCM or mu-law encoding
By default, the Streaming STT service expects PCM16-encoded audio. If you want to use mu-law encoding, see Specifying the encoding.
Step 4: Setup the websocket connection to the Streaming service
Streaming Speech-to-Text uses WebSockets to stream audio to AssemblyAI. This requires first establishing a connection to the API.
Create a ConnectAndTranscribe
method to handle the connection and transcription:
Sample rate
The sampleRate
is the number of audio samples per second, measured in hertz (Hz). Higher sample rates result in higher quality audio, which may lead to better transcripts, but also more data being sent over the network.
We recommend the following sample rates:
- Minimum quality:
8_000
(8 kHz) - Medium quality:
16_000
(16 kHz) - Maximum quality:
48_000
(48 kHz)
If you don’t set a sample rate on the real-time transcriber, it defaults to 16 kHz.
Create a ProcessMessage
method to handle the message processing:
The real-time transcriber returns two types of transcripts: partial and final.
- Partial transcripts are returned as the audio is being streamed to AssemblyAI.
- Final transcripts are returned when the service detects a pause in speech.
End of utterance controls
You can configure the silence threshold for automatic utterance detection and programmatically force the end of an utterance to immediately get a Final transcript.
Step 5: Disconnect the streaming service
In this step, you’ll setup the disconnect logic.
To run the program, first run dotnet build
and then dotnet run
.
Next steps
To learn more about Streaming Speech-to-Text, see the following resources:
Need some help?
If you get stuck, or have any other questions, we’d love to help you out. Contact our support team at support@assemblyai.com or create a support ticket.