Transcribe streaming audio from a microphone in Ruby

Learn how to transcribe streaming audio in Ruby.

Overview

By the end of this tutorial, you’ll be able to transcribe audio from your microphone in Ruby.

Supported languages

Streaming Speech-to-Text is only available for English.

Before you begin

To complete this tutorial, you need:

Here’s the full sample code for what you’ll build in this tutorial:

1require 'websocket-client-simple'
2require 'json'
3require 'open3'
4
5API_KEY = "<YOUR_API_KEY>"
6SAMPLE_RATE = 16000
7CHANNELS = 1
8CHUNK_SIZE = 6400 # 200ms of audio at 16kHz = 3200 samples = 6400 bytes (16-bit = 2 bytes per sample)
9
10$recording = false
11$sox_process = nil
12$stdout_thread = nil
13$exit_requested = false
14
15ws = WebSocket::Client::Simple.connect(
16 "wss://api.assemblyai.com/v2/realtime/ws?sample_rate=#{SAMPLE_RATE}",
17 headers: { 'Authorization' => API_KEY }
18)
19
20ws.on :open do
21 on_open(ws)
22end
23
24ws.on :message do |message|
25 on_message(ws, message.data)
26end
27
28ws.on :error do |error|
29 on_error(ws, error)
30end
31
32ws.on :close do |code, reason|
33 on_close(ws, code, reason)
34end
35
36def on_open(ws)
37 begin
38 command = "sox -d -t raw -b 16 -c #{CHANNELS} -r #{SAMPLE_RATE} -e signed-integer -L -"
39 $sox_process, stdout, stderr, wait_thr = Open3.popen3(command)
40 puts "Started audio recording with SoX"
41
42 $recording = true
43
44 $stdout_thread = Thread.new do
45 while $recording && ws.open?
46 begin
47 audio_data = stdout.read_nonblock(CHUNK_SIZE)
48 if audio_data && !audio_data.empty?
49 audio_message = {
50 "audio_data" => audio_data.bytes
51 }.to_json
52 ws.send(audio_message)
53 end
54 rescue IO::WaitReadable
55 sleep 0.01
56 rescue EOFError
57 puts "Audio stream ended"
58 break
59 end
60 end
61 end
62 rescue => e
63 puts "Error starting audio recording: #{e}"
64 puts e.backtrace
65 end
66end
67
68def on_message(ws, message)
69 begin
70 msg = JSON.parse(message)
71 msg_type = msg['message_type']
72
73 if msg_type == 'SessionBegins'
74 session_id = msg['session_id']
75 puts "Session ID: #{session_id}"
76 return
77 end
78
79 text = msg['text'] || ''
80 return if text.empty?
81
82 if msg_type == 'PartialTranscript'
83 puts "Partial: #{text}"
84 elsif msg_type == 'FinalTranscript'
85 puts "Final: #{text}"
86 elsif msg_type == 'error'
87 puts "Error: #{msg['error'] || 'Unknown error'}"
88 end
89 rescue => e
90 puts "Error handling message: #{e.message}"
91 puts "Raw message: #{message.inspect}"
92 end
93end
94
95def on_error(ws, error)
96 puts "Error: #{error}"
97end
98
99def on_close(ws, code, reason)
100 stop_recording
101 puts 'Disconnected'
102end
103
104def stop_recording
105 if $recording
106 $recording = false
107
108 if $sox_process
109 begin
110 Process.kill('TERM', $sox_process.pid) rescue nil
111 $sox_process.close
112 $stdout_thread.join(2) if $stdout_thread
113 puts "Stopped audio recording"
114 rescue => e
115 puts "Error closing audio recording: #{e}"
116 end
117 end
118 end
119end
120
121Signal.trap('INT') do
122 puts
123 puts 'Stopping recording'
124 stop_recording
125 puts 'Closing real-time transcript connection'
126 $exit_requested = true
127end
128
129loop do
130 if $exit_requested
131 ws.close if ws.open?
132 break
133 end
134 sleep 1
135end

Step 1: Install dependencies

1

First, install SoX (Sound eXchange) to record audio from your microphone.

$# (Mac)
>brew install sox
>
># (Windows)
># You can download the SoX installer from the official website or use a package manager like Chocolatey:
>choco install sox.portable
>
># (Linux)
>apt install sox
2

Then install the required Gem:

$gem install websocket-client-simple

Step 2: Configure the API key

In this step, you’ll configure your API key to authenticate with AssemblyAI.

1

Browse to Account, and then click Copy API key under Your API key to copy it.

2

Store your API key in a variable. Replace YOUR_API_KEY with your copied API key.

1API_KEY = "<YOUR_API_KEY>"

Step 3: Connect to the streaming service

In this step, you’ll create a websocket connection to the Streaming service and configure it to use your API key.

1

Configure the audio requirement variables:

1SAMPLE_RATE = 16000
2CHANNELS = 1
3CHUNK_SIZE = 6400 # 200ms of audio at 16kHz = 3200 samples = 6400 bytes (16-bit = 2 bytes per sample)
4
5$recording = false
6$sox_process = nil
7$stdout_thread = nil
8$exit_requested = false
2

Create the websocket connection and assign your API_KEY:

1ws = WebSocket::Client::Simple.connect(
2 "wss://api.assemblyai.com/v2/realtime/ws?sample_rate=#{SAMPLE_RATE}",
3 headers: { 'Authorization' => API_KEY }
4)
Sample rate

The sampleRate is the number of audio samples per second, measured in hertz (Hz). Higher sample rates result in higher quality audio, which may lead to better transcripts, but also more data being sent over the network.

We recommend the following sample rates:

  • Minimum quality: 8_000 (8 kHz)
  • Medium quality: 16_000 (16 kHz)
  • Maximum quality: 48_000 (48 kHz)

If you don’t set a sample rate on the real-time transcriber, it defaults to 16 kHz.

3

Assign the event handlers (for which we will create in the next step).

1ws.on :open do
2 on_open(ws)
3end
4
5ws.on :message do |message|
6 on_message(ws, message.data)
7end
8
9ws.on :error do |error|
10 on_error(ws, error)
11end
12
13ws.on :close do |code, reason|
14 on_close(ws, code, reason)
15end

Step 4: Create event handlers & record audio from microphone

1

In this step, we’ll create the onOpen even handler which contains our microphone recording logic. We’ll use SoX, a cross-platform audio library, to record audio from the microphone.

1def on_open(ws)
2 begin
3 command = "sox -d -t raw -b 16 -c #{CHANNELS} -r #{SAMPLE_RATE} -e signed-integer -L -"
4 $sox_process, stdout, stderr, wait_thr = Open3.popen3(command)
5 puts "Started audio recording with SoX"
6
7 $recording = true
8
9 $stdout_thread = Thread.new do
10 while $recording && ws.open?
11 begin
12 audio_data = stdout.read_nonblock(CHUNK_SIZE)
13 if audio_data && !audio_data.empty?
14 audio_message = {
15 "audio_data" => audio_data.bytes
16 }.to_json
17 ws.send(audio_message)
18 end
19 rescue IO::WaitReadable
20 sleep 0.01
21 rescue EOFError
22 puts "Audio stream ended"
23 break
24 end
25 end
26 end
27 rescue => e
28 puts "Error starting audio recording: #{e}"
29 puts e.backtrace
30 end
31end
Audio data format

The SoX arguments configure the format of the audio output. The arguments configure the format to a single channel with 16-bit signed integer PCM encoding and 16 kHz sample rate.

If you want to stream data from elsewhere, make sure that your audio data is in the following format:

  • Single channel
  • 16-bit signed integer PCM or mu-law encoding

By default, the Streaming STT service expects PCM16-encoded audio. If you want to use mu-law encoding, see Specifying the encoding.

2

Create functions to handle onClose and onError events from the real-time service.

1def on_error(ws, error)
2 puts "Error: #{error}"
3end
4
5def on_close(ws, code, reason)
6 stop_recording
7 puts 'Disconnected'
8end
3

Create another function to handle transcripts. The real-time transcriber returns two types of transcripts: partial and final.

  • Partial transcripts are returned as the audio is being streamed to AssemblyAI.
  • Final transcripts are returned when the service detects a pause in speech.
1def on_message(ws, message)
2 begin
3 msg = JSON.parse(message)
4 msg_type = msg['message_type']
5
6 if msg_type == 'SessionBegins'
7 session_id = msg['session_id']
8 puts "Session ID: #{session_id}"
9 return
10 end
11
12 text = msg['text'] || ''
13 return if text.empty?
14
15 if msg_type == 'PartialTranscript'
16 puts "Partial: #{text}"
17 elsif msg_type == 'FinalTranscript'
18 puts "Final: #{text}"
19 elsif msg_type == 'error'
20 puts "Error: #{msg['error'] || 'Unknown error'}"
21 end
22 rescue => e
23 puts "Error handling message: #{e.message}"
24 puts "Raw message: #{message.inspect}"
25 end
26end
End of utterance controls

You can configure the silence threshold for automatic utterance detection and programmatically force the end of an utterance to immediately get a Final transcript.

Step 5: Disconnect the streaming service

1

Create the stop_recording function to handle disconnections smoothly.

1def stop_recording
2 if $recording
3 $recording = false
4
5 if $sox_process
6 begin
7 Process.kill('TERM', $sox_process.pid) rescue nil
8 $sox_process.close
9 $stdout_thread.join(2) if $stdout_thread
10 puts "Stopped audio recording"
11 rescue => e
12 puts "Error closing audio recording: #{e}"
13 end
14 end
15 end
16end
2

Also, add the following code to handle the INT signal to stop the recording and disconnect the transcriber.

1Signal.trap('INT') do
2 puts
3 puts 'Stopping recording'
4 stop_recording
5 puts 'Closing real-time transcript connection'
6 $exit_requested = true
7end
8
9loop do
10 if $exit_requested
11 ws.close if ws.open?
12 break
13 end
14 sleep 1
15end

To run the program, use the command ruby main.rb.

Next steps

To learn more about Streaming Speech-to-Text, see the following resources:

Need some help?

If you get stuck, or have any other questions, we’d love to help you out. Contact our support team at support@assemblyai.com or create a support ticket.