Transcribe streaming audio from a microphone in Ruby | AssemblyAI

Overview

By the end of this tutorial, you’ll be able to transcribe audio from your microphone in Ruby.

Supported languages

Streaming Speech-to-Text is only available for English.

Before you begin

To complete this tutorial, you need:

Ruby installed.
An AssemblyAI account with credit card set up.

Here’s the full sample code for what you’ll build in this tutorial:

1 require 'websocket-client-simple'
2 require 'json'
3 require 'open3'
4 
5 API_KEY = "<YOUR_API_KEY>"
6 SAMPLE_RATE = 16000
7 CHANNELS = 1
8 CHUNK_SIZE = 6400 # 200ms of audio at 16kHz = 3200 samples = 6400 bytes (16-bit = 2 bytes per sample)
9 
10 $recording = false
11 $sox_process = nil
12 $stdout_thread = nil
13 $exit_requested = false
14 
15 ws = WebSocket::Client::Simple.connect(
16   "wss://api.assemblyai.com/v2/realtime/ws?sample_rate=#{SAMPLE_RATE}",
17   headers: { 'Authorization' => API_KEY }
18 )
19 
20 ws.on :open do
21   on_open(ws)
22 end
23 
24 ws.on :message do |message|
25   on_message(ws, message.data)
26 end
27 
28 ws.on :error do |error|
29   on_error(ws, error)
30 end
31 
32 ws.on :close do |code, reason|
33   on_close(ws, code, reason)
34 end
35 
36 def on_open(ws)
37   begin
38     command = "sox -d -t raw -b 16 -c #{CHANNELS} -r #{SAMPLE_RATE} -e signed-integer -L -"
39     $sox_process, stdout, stderr, wait_thr = Open3.popen3(command)
40     puts "Started audio recording with SoX"
41 
42     $recording = true
43 
44     $stdout_thread = Thread.new do
45       while $recording && ws.open?
46         begin
47           audio_data = stdout.read_nonblock(CHUNK_SIZE)
48           if audio_data && !audio_data.empty?
49             audio_message = {
50                 "audio_data" => audio_data.bytes
51             }.to_json
52             ws.send(audio_message)
53           end
54         rescue IO::WaitReadable
55           sleep 0.01
56         rescue EOFError
57           puts "Audio stream ended"
58           break
59         end
60       end
61     end
62   rescue => e
63     puts "Error starting audio recording: #{e}"
64     puts e.backtrace
65   end
66 end
67 
68 def on_message(ws, message)
69   begin
70     msg = JSON.parse(message)
71     msg_type = msg['message_type']
72 
73     if msg_type == 'SessionBegins'
74       session_id = msg['session_id']
75       puts "Session ID: #{session_id}"
76       return
77     end
78 
79     text = msg['text'] || ''
80     return if text.empty?
81 
82     if msg_type == 'PartialTranscript'
83       puts "Partial: #{text}"
84     elsif msg_type == 'FinalTranscript'
85       puts "Final: #{text}"
86     elsif msg_type == 'error'
87       puts "Error: #{msg['error'] || 'Unknown error'}"
88     end
89   rescue => e
90     puts "Error handling message: #{e.message}"
91     puts "Raw message: #{message.inspect}"
92   end
93 end
94 
95 def on_error(ws, error)
96   puts "Error: #{error}"
97 end
98 
99 def on_close(ws, code, reason)
100   stop_recording
101   puts 'Disconnected'
102 end
103 
104 def stop_recording
105   if $recording
106     $recording = false
107 
108     if $sox_process
109       begin
110         Process.kill('TERM', $sox_process.pid) rescue nil
111         $sox_process.close
112         $stdout_thread.join(2) if $stdout_thread
113         puts "Stopped audio recording"
114       rescue => e
115         puts "Error closing audio recording: #{e}"
116       end
117     end
118   end
119 end
120 
121 Signal.trap('INT') do
122   puts
123   puts 'Stopping recording'
124   stop_recording
125   puts 'Closing real-time transcript connection'
126   $exit_requested = true
127 end
128 
129 loop do
130   if $exit_requested
131     ws.close if ws.open?
132     break
133   end
134   sleep 1
135 end

Step 1: Install dependencies

First, install SoX (Sound eXchange) to record audio from your microphone.

$ # (Mac)
> brew install sox
> 
> # (Windows)
> # You can download the SoX installer from the official website or use a package manager like Chocolatey:
> choco install sox.portable
> 
> # (Linux)
> apt install sox

Then install the required Gem:

$ gem install websocket-client-simple

Step 2: Configure the API key

In this step, you’ll configure your API key to authenticate with AssemblyAI.

Browse to API Keys in your dashboard, and then copy your API key.

Store your API key in a variable. Replace YOUR_API_KEY with your copied API key.

1 API_KEY = "<YOUR_API_KEY>"

Step 3: Connect to the streaming service

In this step, you’ll create a WebSocket connection to the Streaming service and configure it to use your API key.

Configure the audio requirement variables:

1 SAMPLE_RATE = 16000
2 CHANNELS = 1
3 CHUNK_SIZE = 6400 # 200ms of audio at 16kHz = 3200 samples = 6400 bytes (16-bit = 2 bytes per sample)
4 
5 $recording = false
6 $sox_process = nil
7 $stdout_thread = nil
8 $exit_requested = false

Create the WebSocket connection and assign your API_KEY:

1 ws = WebSocket::Client::Simple.connect(
2   "wss://api.assemblyai.com/v2/realtime/ws?sample_rate=#{SAMPLE_RATE}",
3   headers: { 'Authorization' => API_KEY }
4 )

Sample rate

The sampleRate is the number of audio samples per second, measured in hertz (Hz). Higher sample rates result in higher quality audio, which may lead to better transcripts, but also more data being sent over the network.

We recommend the following sample rates:

Minimum quality: 8_000 (8 kHz)
Medium quality: 16_000 (16 kHz)
Maximum quality: 48_000 (48 kHz)

If you don’t set a sample rate on the real-time transcriber, it defaults to 16 kHz.

Assign the event handlers (for which we will create in the next step).

1 ws.on :open do
2   on_open(ws)
3 end
4 
5 ws.on :message do |message|
6   on_message(ws, message.data)
7 end
8 
9 ws.on :error do |error|
10   on_error(ws, error)
11 end
12 
13 ws.on :close do |code, reason|
14   on_close(ws, code, reason)
15 end

Step 4: Create event handlers & record audio from microphone

In this step, we’ll create the onOpen even handler which contains our microphone recording logic. We’ll use SoX, a cross-platform audio library, to record audio from the microphone.

1 def on_open(ws)
2   begin
3     command = "sox -d -t raw -b 16 -c #{CHANNELS} -r #{SAMPLE_RATE} -e signed-integer -L -"
4     $sox_process, stdout, stderr, wait_thr = Open3.popen3(command)
5     puts "Started audio recording with SoX"
6 
7     $recording = true
8 
9     $stdout_thread = Thread.new do
10       while $recording && ws.open?
11         begin
12           audio_data = stdout.read_nonblock(CHUNK_SIZE)
13           if audio_data && !audio_data.empty?
14             audio_message = {
15                 "audio_data" => audio_data.bytes
16             }.to_json
17             ws.send(audio_message)
18           end
19         rescue IO::WaitReadable
20           sleep 0.01
21         rescue EOFError
22           puts "Audio stream ended"
23           break
24         end
25       end
26     end
27   rescue => e
28     puts "Error starting audio recording: #{e}"
29     puts e.backtrace
30   end
31 end

Audio data format

The SoX arguments configure the format of the audio output. The arguments configure the format to a single channel with 16-bit signed integer PCM encoding and 16 kHz sample rate.

If you want to stream data from elsewhere, make sure that your audio data is in the following format:

Single channel
16-bit signed integer PCM or mu-law encoding

By default, the Streaming STT service expects PCM16-encoded audio. If you want to use mu-law encoding, see Specifying the encoding.

Create functions to handle onClose and onError events from the real-time service.

1 def on_error(ws, error)
2   puts "Error: #{error}"
3 end
4 
5 def on_close(ws, code, reason)
6   stop_recording
7   puts 'Disconnected'
8 end

Create another function to handle transcripts. The real-time transcriber returns two types of transcripts: partial and final.

Partial transcripts are returned as the audio is being streamed to AssemblyAI.
Final transcripts are returned when the service detects a pause in speech.

1 def on_message(ws, message)
2   begin
3     msg = JSON.parse(message)
4     msg_type = msg['message_type']
5 
6     if msg_type == 'SessionBegins'
7       session_id = msg['session_id']
8       puts "Session ID: #{session_id}"
9       return
10     end
11 
12     text = msg['text'] || ''
13     return if text.empty?
14 
15     if msg_type == 'PartialTranscript'
16       puts "Partial: #{text}"
17     elsif msg_type == 'FinalTranscript'
18       puts "Final: #{text}"
19     elsif msg_type == 'error'
20       puts "Error: #{msg['error'] || 'Unknown error'}"
21     end
22   rescue => e
23     puts "Error handling message: #{e.message}"
24     puts "Raw message: #{message.inspect}"
25   end
26 end

End of utterance controls

You can configure the silence threshold for automatic utterance detection and programmatically force the end of an utterance to immediately get a Final transcript.

Step 5: Disconnect the streaming service

Create the stop_recording function to handle disconnections smoothly.

1 def stop_recording
2   if $recording
3     $recording = false
4 
5     if $sox_process
6       begin
7         Process.kill('TERM', $sox_process.pid) rescue nil
8         $sox_process.close
9         $stdout_thread.join(2) if $stdout_thread
10         puts "Stopped audio recording"
11       rescue => e
12         puts "Error closing audio recording: #{e}"
13       end
14     end
15   end
16 end

Also, add the following code to handle the INT signal to stop the recording and disconnect the transcriber.

1 Signal.trap('INT') do
2   puts
3   puts 'Stopping recording'
4   stop_recording
5   puts 'Closing real-time transcript connection'
6   $exit_requested = true
7 end
8 
9 loop do
10   if $exit_requested
11     ws.close if ws.open?
12     break
13   end
14   sleep 1
15 end

To run the program, use the command ruby main.rb.

Next steps

To learn more about Streaming Speech-to-Text, see the following resources:

Need some help?

If you get stuck, or have any other questions, we’d love to help you out. Contact our support team at support@assemblyai.com or create a support ticket.