In this tutorial, we’ll use the open-source speech recognition toolkit Kaldi in conjunction with Python to automatically transcribe audio files. By the end of the tutorial, you’ll be able to get transcriptions in minutes with one simple command!
Important Note
For this tutorial, we are using Ubuntu 20.04.03 LTS (x86_64 ISA). If you are on Windows, the recommended procedure is to install a virtual machine and follow this tutorial exactly on a Debian-based distro (preferably the exact one mentioned above - you can find an ISO here)
Before we can get started with Kaldi for Speech Recognition, we'll need to perform some installations.
Installations
Prerequisites
The most notable prerequisite is time and space. The Kaldi installation can take hours, and consumes almost 40 GB of disk space, so prepare accordingly. If you need transcriptions ASAP, check out the Cloud Speech-to-Text APIs section!
Automatic Installation
If you would like to manually install Kaldi and its dependencies, you can move on to the next subsection. If you are comfortable with an automatic installation, you can follow this subsection.
You will need wget
and git
installed on your machine in order to follow along. wget
comes installed natively on most Linux distributions, but you may need to open a terminal and install git
with
Next, navigate into the directory in which you would like to install Kaldi, and then fetch the installation script with
This command downloads the setup.sh file, which effectively just automates the manual installation below. Be sure to open this file in a text editor and inspect it to make sure you understand it and are comfortable running it. You can then perform the setup with
Install Note
If you have multiple CPUs, you can perform a parallel build by supplying the number of processors you would like to use. For example, to use 4 CPUs, enter sudo bash setup.sh 4
Running the above command will install all of Kaldi's dependencies, and then Kaldi itself. You will be required to confirm that all dependencies are installed at one point (several minutes into the installation). We suggest checking and confirming, but if you are following along on a fresh Ubuntu 20.04.03 LTS install (perhaps on a virtual machine), then you can skip confirming by instead running
In this case, you do not need to interact with the terminal at all during installation. The installation will likely take several hours, so you can leave and come back to it when the installation is complete. Once the installation is complete, enter the project directory with
and then move on to transcribing an audio file.
Manual Installation
Before manually installing Kaldi, we’ll need to install some additional packages. First, open a terminal, and run the following commands:
Additional Information
- You can copy these commands and paste them into the terminal by right clicking in terminal and selecting “Paste”.
- We’ll also need Intel MKL, which we will install later via Kaldi if you do not have it already.
Installing Kaldi
Now we can get started installing Kaldi for Speech Recognition. First, we need to clone the Kaldi repository. In the terminal, navigate to the directory in which you’d like to clone the repository. In this case, we are cloning to the Home directory.
Run the following command:
Installing Tools
To begin our Kaldi installation, we’ll first need to perform the tools
installation. Navigate into the tools
directory with the following command:
and then install Intel MKL if you don’t already have it. This will take time - MKL is a large library.
Now we check to ensure all dependencies are installed. Given our preparatory installations, you should get a message telling you that all dependencies are indeed installed.
If you do not have all dependencies installed, you will get an output telling you which dependencies are missing. Install any remaining packages you need, and then rerun the extras/check_dependencies.sh
command. New required installations may now appear as a result of the dependencies you just installed. Continue alternating between these two steps (checking missing dependencies and installing them) until you receive a message saying that all dependencies are installed ("all OK.").
Finally, run make
. See the install note below if you have a multi-CPU build.
Install Note
If you have multiple CPUs, you can do a parallel build by supplying the "-j" option to make
in order to expedite the install. For example, to use 4 CPUs, enter make -j 4
Installing Src
Next, we need to perform src
install. First, cd
into src
And then run the following commands. See the install note below if you have a multi-CPU build. This build may take several hours for uniprocessor systems.
Install Note
Again, you can supply the -j
option to both make depend
and make
if you have multiple CPUs in order to expedite the install. For example, to use 4 CPUs, enter make depend -j 4
and make -j 4
Cloning the Project Repository
Now it’s time to clone the project repository provided by AssemblyAI, which hosts the code required for the remainder of the tutorial. The project repository follows the structure of the other folders in kaldi/egs (the “examples” directory in Kaldi root) and includes additional files to automate the transcription generation for you.
Navigate into egs folder, clone the project repository, and then navigate into the s5 subdirectory
Additional Information
At this point you can delete all other folders in the egs
directory. They take up about 10 GB of disk space, but consist of other examples that you may want to check out after this tutorial.
Transcribing an Audio File - Quick Usage
Now we’re ready to get started transcribing an audio file! We’ve provided everything you need to automatically transcribe a .wav file in a single line of code.
For a minimal example, all you need to do is run
This command will transcribe the provided example audio file gettysburg.wav
- a 10 second .wav file containing the first line of the Gettysburg Address. The command will take several minutes to execute, after which you will find the transcription in kaldi-asr-tutorial/s5/out.txt
Important Note
You will need an internet connection the first time you run main
.py in order to download the pre-trained models.
If you would like to transcribe your own .wav file, first place it in the s5 subdirectory, and then run:
Where you replace gettysburg.wav
with the name of your file. If the only .wav file in the s5 subdirectory is your target audio file, you can simply run python3 main.py
without specifying the filename.
This automated process will work best with a single speaker and a relatively short audio. For more complicated usage, you’ll have to read the next section and modify the code to suit your needs, following along with the Kaldi documentation.
Resetting the Directory
Each time you run main.py
it will call reset_directory.py
, which removes all files/folders generated by main.py
(except the downloaded tarballs of the pre-trained models) in order to start each run with a clean slate. This means that your out.txt
transcription will be deleted if you call main.py
on another file, so be sure to move out.txt
to another directory if you would like to keep it before transcribing another file.
If you interrupt the main.py
execution while the pre-trained models are downloading, you will receive errors downstream. In this case, run the following command to completely reset the directory (i.e. remove the pre-trained model tarballs in addition to the files/folder removed by reset_directory.py
)
Transcribing an Audio File - Understanding the Code
If you’re interested in understanding how Kaldi's Speech Recognition generated the transcription in the previous section, then read on!
We’re going to dive into main.py
in order to understand the entire process of generating a transcription with Kaldi. Keep in mind that our use case is a toy example to showcase how to use pre-trained Kaldi models for ASR. Kaldi is a very powerful toolkit which accommodates much more complicated usage; but it does have a sizable learning curve, so learning how to properly apply it to more complicated tasks will take some time.
Also, we’ll give brief overviews of the theory behind what’s going on in different sections, but ASR is a complicated topic, so by nature our conversation will be surface level!
Let’s get started.
Imports
We kick things off with some imports. First, we call the reset_directory.py
file that clears the directory of files/folders generated by the rest of main.py
so we can start with a clean slate. Then we import subprocess
so we can issue bash commands, as well as some other packages which we’ll use for os navigation and file manipulation.
import reset_directory
import subprocess as s
import os
import sys
import glob
Argument Validation
Next, we perform some argument validation. We ensure that there is a maximum of one additional argument passed in to main.py
; and, if there is one, we ensure that is a .wav file. If there is no argument given, then we simply choose the first .wav file found by glob.glob
, if such a file exists.
We save the filename (with and without extension) in variables for later use.
if len(sys.argv) == 1:
try:
FILE_NAME_WAV = glob.glob("*.wav")[0]
except:
raise ValueError("No .wav file in the root directory")
elif len(sys.argv) == 2:
FILE_NAME_WAV = list(sys.argv)[1]
if FILE_NAME_WAV[-4:] != ".wav":
raise ValueError("Provided filename does not end in '.wav'")
else:
raise ValueError("Too many arguments provided. Aborting")
FILE_NAME = FILE_NAME_WAV[:-4]
Kaldi File Generation
Now it’s time to create some standard files that Kaldi requires to generate transcriptions. We save the s5 directory path into a variable so that we can easily navigate back to it, and then create and navigate into a data/test directory that we will store our data in.
ORIGINAL_DIRECTORY = os.getcwd()
# Make data/test dir
os.makedirs("./data/test")
os.chdir("./data/test")
The first file we’ll generate is called spk2utt
, which maps speakers to their utterances. For our purposes, we assume that there is one speaker and one utterance, so the file is easy to generate automatically.
with open("spk2utt", "w") as f:
f.write("global {0}".format(FILE_NAME))
Next, we create the inverse mapping in the utt2spk
file. Note that this file is one-to-one, unlike the one-to-many nature of spk2utt (one speaker may have multiple utterances, but each utterance can have only one speaker). For our purposes it is once again easy to generate this file:
with open("utt2spk", "w") as f:
f.write("{0} global".format(FILE_NAME))
The last file we create is called wav.scp
. It maps audio file identifiers to their system paths. We again generate this file automatically.
wav_path = os.getcwd() + "/" + FILE_NAME_WAV
with open("wav.scp", "w") as f:
f.write("{0} {1}".format(FILE_NAME, wav_path))
Finally, we return to the root directory
os.chdir(ORIGINAL_DIRECTORY)
Additional Information
Note that these are not the only possible input files that Kaldi can use, just the bare minimum. For more advanced usage, such as gender mapping, check out the Kaldi documentation.
MFCC Configuration File Modification
To perform ASR with Kaldi on our audio file, we must first determine some method of representing this data in a format that a Kaldi model can handle. For this, we use Mel-frequency cepstral coefficients (MFCCs). MFCCs are a set of coefficients that define the mel-frequency cepstrum of the audio, which itself is a cosine transform of the logarithmic power spectrum of a nonlinear mapping (mel-frequency) of the Fourier transform of the signal. If that sounds confusing, don’t worry - it’s not necessary to understand for the purposes of generating transcriptions! The important thing to know is that MFCCs are a low dimensional representation of an audio signal that are inspired by human auditory processing.
There is a configuration file that we use when we are generating MFCCs, located in ./conf/mfcc_hires.conf. The only thing we need to know from a practical standpoint is that we must modify this file to list the proper sample rate for our input .wav file. We do this automatically as follows:
First, we call a subprocess which opens a bash shell and uses sox
to get the audio information of the .wav file. Then, we perform string manipulation to isolate the sample rate of the .wav file.
bash_out = s.run("soxi {0}".format(FILE_NAME_WAV), stdout=s.PIPE, text=True, shell=True)
cleaned_list = bash_out.stdout.replace(" ","").split('\n')
sample_rate = [x for x in cleaned_list if x.startswith('SampleRate:')]
sample_rate = sample_rate[0].split(":")[1]
Next, we open and read the MFCC configuration file so that we can modify it
with open("./conf/mfcc_hires.conf", "r") as mfcc:
lines = mfcc.readlines()
And identify the line that sets the sample frequency and isolate it.
line_idx = [lines.index(l) for l in lines if l.startswith('--sample-frequency=')]
line = lines[line_idx[0]]
Next, we reformat this line to list the sample rate of our .wav file as identified by the soxi
command.
line = line.split("=")
line[1] = sample_rate + line[1][line[1].index(" #"):]
line = "=".join(line)
Finally, we replace the relevant line in the lines
list, collapse this list back into a string, and then write this string to the MFCC configuration file.
lines[line_idx[0]] = line
final_str = "".join(lines)
with open("./conf/mfcc_hires.conf", "w") as mfcc:
mfcc.write(final_str)
Feature Extraction
Now we can get started processing our audio file. First, we open a file for logging our bash outputs, which we will use for every bash command going forward. Then, we copy our .wav file into the ./data/test directory, and then copy the whole ./data/test directory into a new directory (./data/test_hires) for processing.
with open("main_log.txt", "w") as f:
bash_out = s.run("cp {0} data/test/{0}".format(FILE_NAME_WAV), stdout=f, text=True, shell=True)
bash_out = s.run("utils/copy_data_dir.sh data/test data/test_hires", stdout=f, text=True, shell=True)
Next, we generate MFCC features using our data and the configuration file we previously modified.
bash_out = s.run("steps/make_mfcc.sh --nj 1 --mfcc-config "
"conf/mfcc_hires.conf data/test_hires", stdout=f, text=True, shell=True)
Additional Information
More information about the arguments of the bash command can be found here:
steps/make_mfcc.sh
: specifies the location of the shell script which generates mfccs--nj 1
: specifies the number of jobs to run with. If you have a multi-core machine, you can increase this number--mfcc-config conf/mfcc_hires.conf
: specifies the location of the configuration file we previously modifieddata/test_hires
: specifies the data folder containing the relevant data we will operate on
This command generates the conf
, data
, and log
directories as well as the feats.scp
, frame_shift
, utt2dur
, and utt2num_frames
files (all within the data/test_hires
directory)
After this, we compute the cepstral mean and variance normalization (CMVN) statistics on the data, which minimizes the distortion caused by noise contamination. That is, CMVN helps make our ASR system more robust against noise.
bash_out = s.run("steps/compute_cmvn_stats.sh data/test_hires", stdout=f, text=True, shell=True)
Finally, we use the fix_data_dir.sh
shell script to ensure that the files within the data directory are properly sorted and filtered, and also to create a data backup in data/test_hires/.backup
.
bash_out = s.run("utils/fix_data_dir.sh data/test_hires", stdout=f, text=True, shell=True)
Pre-trained Model Download and Extraction
Now that we have performed MFCC feature extraction and CMVN normalization, we need a model to pass the data through. In this case we will be using the Librispeech ASR Model, found in Kaldi’s pre-trained model library, which was trained on the LibriSpeech dataset. This model is composed of four submodels:
- An i-vector extractor
- A TDNN-F based chain model
- A small trigram language model
- An LSTM-based model for rescoring
To download these models, we first check to see if these tarballs are already in our directory. If they are not, we download them using wget
for component in ["chain", "extractor", "lm"]:
tarball = "0013_librispeech_v1_{0}.tar.gz".format(component)
if tarball not in os.listdir():
bash_out = s.run('wget http://kaldi-asr.org/models/13/{0}'.format(tarball), stdout=f, text=True, shell=True)
and extract them using tar
.
bash_out = s.run('for f in *.tar.gz; do tar -xvzf "$f"; done', stdout=f, text=True, shell=True)
This creates the exp/nnet3_cleaned
, exp/chain_cleaned
, data/lang_test_tgsmall
, and exp/rnnlm_lstm_1a
directories.
nnet3_cleaned
is the i-vector extractor directorychain_cleaned
is the chain model directorytgsmall
is the small trigram language model directory- and
rnnlm
is the LSTM-based rescoring model
Warning
If the wget
process is interrupted during download, you will run into errors downstream. In this case, run the below in terminal to delete any model tarballs that are there and completely reset the directory. We call reset_directory.py
rather than reset_directory_completely.py
by default so we don't have to download the models (~430 MB compressed) each time we run main.py
.
Decoding Generation
Extracting i-vectors
Next up, we’ll extract i-vectors, which are used to identify different speakers. Even though we have only one speaker in this case, we extract i-vectors anyway for the general use case, and because they are expected downstream.
We create a directory to store the i-vectors and then run a bash command to extract them:
os.makedirs("./exp/nnet3_cleaned/ivectors_test_hires")
bash_out = s.run("steps/online/nnet2/extract_ivectors_online.sh --nj 1 "
"data/test_hires exp/nnet3_cleaned/extractor exp/nnet3_cleaned/ivectors_test_hires",
stdout=f, text=True, shell=True)
Additional Information
More information about the arguments of the bash command can be found here:
steps/online/nnet2/extract_ivectors_online.sh
: specifies the location of the shell script which extracts the i-vectors--nj 1
: specifies the number of jobs to run with. If you have a multi-core machine, you can increase this numberdata/test_hires
: specifies the location of the data directoryexp/nnet3_cleaned/extractor
: specifies the location of the extractor directoryexp/nnet3_cleaned/ivectors_test_hires
: specifies the location to store the i-vectors
Constructing the Decoding Graph
In order to get our transcription, we need to pass our data through the decoding graph. In our case, we will construct a fully-expanded decoding graph (HCLG) that represents the language model, lexicon (pronunciation dictionary), context-dependency, and HMM structure in the model.
Additional Information
The output of the decoding graph is a Finite State Transducer that has word-ids on the output, and transition-ids on the input (the indices that resolve to pdf-ids)
HCLG stands for a composition of functions, where
- H contains HMM definitions, whose inputs are transition-ids and outputs are context-dependent phones
- C is the context-dependency, that takes in context-dependent phones and outputs phones
- L is the lexicon, which takes in phones and outputs words
- and G is an acceptor that encodes the grammar or language model, which both takes in and outputs words
The end result is our decoding, in this case a transcription of our single utterance.
Before we can pass our data through the decoding graph, we need to construct it. We create a directory to store the graph, and then construct it with the following command.
os.makedirs("./exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall")
bash_out = s.run("utils/mkgraph.sh --self-loop-scale 1.0 --remove-oov "
"data/lang_test_tgsmall exp/chain_cleaned/tdnn_1d_sp exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall",
stdout=f, text=True, shell=True)
Additional Information
More information about the arguments of the bash command can be found here:
utils/mkgraph.sh
: specifies the location of the shell script which constructs the decoding graph--self-loop-scale 1.0
: Scales self-loops by the specified value relative to the language model1--remove-oov
: remove out-of-vocabulary (oov) wordsdata/lang_test_tgsmall
: specifies the location of the language directoryexp/chain_cleaned/tdnn_1d_sp
: specifies the location of the model directoryexp/chain_cleaned/tdnn_1d_sp/graph_tgsmall
: specifies the location to store the constructed graph
Decoding using the Generated Graph
Now that we have constructed our decoding graph, we can finally use it to generate our transcription!
First we create a directory to store the decoding information, and then decode using the following command.
os.makedirs("./exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall")
bash_out = s.run("steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 --nj 1 "
"--online-ivector-dir exp/nnet3_cleaned/ivectors_test_hires "
"exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall "
"data/test_hires exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall",
stdout=f, text=True, shell=True)
Additional Information
More information about the arguments of the bash command can be found here:
steps/nnet3/decode.sh
: specifies the location of the shell script which runs the decoding--acwt 1.0
: Sets the acoustic scale. The default is 0.1, but this is not suitable for chain models2--post-decode-acwt 10.0
: Scales the acoustics by 10 so that the regular scoring script works (necessary for chain models)--nj 1
: specifies the number of jobs to run with. If you have a multi-core machine, you can increase this number--online-ivector-dir exp/nnet3_cleaned/ivectors_test_hires
: specifies the i-vector directoryexp/chain_cleaned/tdnn_1d_sp/graph_tgsmall
: specifies the location of the graph directorydata/test_hires
: specifies the location of the data directoryexp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall
: specifies the location to store the decoding information
Transcription Retrieval
It’s time to retrieve our transcription! The transcription lattice is stored as a GNU zip file in the decode_test_tgsmall
directory, among other files (including word-error rates if you have input a Kaldi text
file).
We store the directory paths of our zip file and graph word.txt
file, and then pass these into a command
variable which stores our bash command. This command unzips our zip file, and then writes the optimal path through the lattice (the transcription) to a file called out.txt
in our s5
directory.
gz_location = "exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall/lat.1.gz"
words_txt_loc = "{0}/exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall/words.txt".format(ORIGINAL_DIRECTORY)
command = "../../../src/latbin/lattice-best-path " \
"ark:'gunzip -c {0} |' " \
"'ark,t:| utils/int2sym.pl -f 2- " \
"{1} > out.txt'".format(gz_location, words_txt_loc)
bash_out = s.run(command, stdout=f, text=True, shell=True)
Additional Information
More information about the arguments of the bash command can be found here:
Let’s take a look at how our generated transcription compares to the true transcription!
Real:
FOUR SCORE AND SEVEN YEARS AGO OUR FATHERS BROUGHT FORTH ON THIS CONTINENT A NEW NATION CONCEIVED IN LIBERTY AND DEDICATED TO THE PROPOSITION THAT ALL MEN ARE CREATED EQUAL
Transcription:
FOUR SCORE AN SEVEN YEARS AGO OUR FATHERS BROUGHT FORTH UND IS CONTINENT A NEW NATION CONCEIVED A LIBERTY A DEDICATED TO THE PROPOSITION THAT ALL MEN ARE CREATED EQUAL
Out of 30 words we had 5 errors, yielding a word error rate of about 17%.
Rescoring with LSTM-based Model
We can rescore with the LSTM-based model using the below command:
command = "../../../scripts/rnnlm/lmrescore_pruned.sh --weight 0.45 --max-ngram-order 4 " \
"data/lang_test_tgsmall exp/rnnlm_lstm_1a data/test_hires " \
"exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall exp/chain_cleaned/tdnn_1d_sp/decode_test_rescore"
bash_out = s.run(command, stdout=f, text=True, shell=True)
Additional Information
More information about the arguments of the bash command can be found here:
../../../scripts/rnnlm/lmrescore_pruned.sh
: specifies the location of the shell script which runs the rescoring5--weight 0.45
: specifies the interpolation weight for the RNNLM--max-ngram-order 4
: approximates the lattice-rescoring by merging histories in the lattice if they share the same ngram history which prevents the lattice from exploding exponentiallydata/lang_test_tgsmall
: specifies the old language model directoryexp/rnnlm_lstm_1a
: specifies the RNN language model directorydata/test_hires
: specifies the data directoryexp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall
: specifies the input decoding directoryexp/chain_cleaned/tdnn_1d_sp/decode_test_rescore
: specifies the output decoding directory
We again output the transcription to a .txt file, in this case called out_rescore.txt
:
command = "../../../src/latbin/lattice-best-path " \
"ark:'gunzip -c exp/chain_cleaned/tdnn_1d_sp/decode_test_rescore/lat.1.gz |' " \
"'ark,t:| utils/int2sym.pl -f 2- " \
"{0}/exp/chain_cleaned/tdnn_1d_sp/graph_tgsmall/words.txt > out_rescore.txt'".format(ORIGINAL_DIRECTORY)
bash_out = s.run(command, stdout=f, text=True, shell=True)
In our case, rescoring did not change our generated transcription, but it may improve yours!
Advanced Kaldi Speech Recognition
Hopefully this tutorial gave you an understanding of the Kaldi basics and a jumping off point for more complicated NLP tasks! We just used a single utterance and a single .wav
file, but we might also consider cases where we want to do speaker identification, audio alignment, or more.
You can also go beyond using pre-trained models with Kaldi. For example, if you have data to train your own model, you could make your own end-to-end system, or integrate a custom acoustic model into a system that uses a pre-trained language model. Whatever your goals, you can use the building blocks identified in this article to help you get started!
There are a ton of different ways to process audio to extract useful information, and each way offers its own subfield rich with task-specific knowledge and a history of creative approaches. If you want to dive deeper into Kaldi to build your own complicated NLP systems, you can check out the Kaldi documentation here.
Cloud Speech-to-Text APIs
Kaldi is a very powerful and well-maintained framework for NLP applications, but it’s not designed for the casual user. It can take a long time to understand how Kaldi operates under the hood, an understanding that is necessary to put it to proper use.
In this vein, Kaldi is consequently not designed for plug-and-play speech processing applications. This can pose difficulties for those who don’t have the time or know-how to customize and train NLP models, but who want to implement speech recognition in larger applications.
If you want to get high quality transcripts in just a few lines of code, AssemblyAI offers a fast, accurate, and easy-to-use Speech-to-Text API. You can sign up for a free API token here and gain access to state-of-the-art models that provide:
- Core Transcription
- Asynchronous Speech-to-Text
- Real-Time Speech-to-Text
- Audio Intelligence
- Summarization
- Emotion Detection
- Sentiment Analysis
- Topic Detection
- Content Moderation
- Entity Detection
- PII Redaction
- And much more!
Grab a token and check out the AssemblyAI docs to get started.
Footnotes
1) Link to "Scaling of transition of acoustic probabilities" in the Kaldi documentation
2) Link to "Decoding with 'chain' models" in the Kaldi documentation
3) Link to "Extended filenames: rxfilenames and wxfilenames" in the Kaldi documentation
4) Link to "Table I/O" in the Kaldi documentation
5) Link to the lmrescore_pruned.sh
script in the Kaldi ASR GitHub repo
6) For other beginner resources on getting started with Kaldi, check out this, this, or this resource. Elements from these sources have been adapted for use within this article.