Tutorials

How Well Does AI Transcribe Song Lyrics?

Since the industry wide adoption of deep learning technology, the ability of AI to recognize speech has dramatically increased. Today, AI transcription technology is on the same level of accuracy as human transcription. At least, that’s the case with human speech. We were curious how well AI can recognize lyrics in songs. Singing and speaking are very different in terms of pitch, duration, and intensity. Singing has higher intensity, more variation in pitch, and uses much more air with each phr

How Well Does AI Transcribe Song Lyrics?

Since the industry wide adoption of deep learning technology, the ability of AI to recognize speech has dramatically increased. Today, AI transcription technology is on the same level of accuracy as human transcription. At least, that’s the case with human speech. We were curious how well AI can recognize lyrics in songs.

Singing and speaking are very different in terms of pitch, duration, and intensity. Singing has higher intensity, more variation in pitch, and uses much more air with each phrase. Singing also has much less articulation, and much more emphasis on vowels. In regular speech, vowel to consonant ratio is about five to one, in singing it can be as high as 200 to one. That’s 40 times as much vowel time as expected for an AI that recognizes speech on top of a larger range of intensity and pitch. Based on these differences, we didn’t expect Automatic Speech Recognition models to be able to recognize more than 5%, or 1/20th, of a song. In our findings below, this turns out to be mostly true, but there were some scenarios where our Automatic Speech Recognition models were able to transcribe a song fairly well!

How to Test AI Transcription on Song Lyrics

We transcribed 15 songs from 3 different genres, and 15 different artists, with AssemblyAI’s Speech-to-Text API and calculated their word error rates (WERs) with the Python jiwer library in order to get an idea of how well AI transcribes music. Since we only expect the AI to get 1/20th, or 5%, of the song correct, we’re looking to expect WER scores of over 0.95.

How did we go about analyzing these songs? We used youtube_dl, the Python library, to download videos from YouTube, and extracted the audio into an mp3 file. Then we used Python Click and the AssemblyAI Speech-to-Text API to upload and transcribe the audio into text files. We went on Google to look up the official lyrics to our chosen songs and copied them into text files. Finally, we used the jiwer Python library to calculate the word error rate for each song by genre and compare. You can find the source code here. The three genres we’ll evaluate below are pop, rock, and RnB.

How Well Does AI Recognize Pop Music?

We’re going to download five well-known pop songs from YouTube,use AssemblyAI’s AI transcription API to transcribe them, and then calculate their word error rate according to official lyrics. The five pop songs we’ll pick to analyze are "Thriller" by Michael Jackson, "Watermelon Sugar" by Harry Styles, "Good 4 U" by Olivia Rodrigo, "Levitating" by Dua Lipa, and "Wildest Dreams" by Taylor Swift.

Surprisingly, the most recognized song was "Wildest Dreams", by Taylor Swift.

And, drumroll please, the least recognized song was “Thriller” by Michael Jackson.

If you listen to the songs, this kind of makes sense. "Thriller"’s vocals are faster, and Michael Jackson’s voice covers more range. The order of word error rate from lowest to highest for the selected pop songs is "Wildest Dreams", "Good 4 U", "Levitating", "Watermelon Sugar", and "Thriller". Overall, female voices were recognized better for this set of songs, maybe that’s true throughout the other genres too. Now to the actual numbers! ​​

Here are the numbers, in order, from lowest to highest word error rates for pop music. The lowest WER is "Wildest Dreams" by Taylor Swift with a WER score of 0.593, second is "Good 4 U" by Olivia Rodrigo with a WER score of 0.660, in third place is "Levitating" by Dua Lipa at 0.745, in fourth place is "Watermelon Sugar" by Harry Styles with a WER of 0.807, and finally Michael Jackson’s famous "Thriller" has the highest WER score at 0.878.
These numbers show that we managed to transcribe almost half of Taylor Swift’s "Wildest Dreams" song, which way exceeded our expectations. Let’s take a look at what the transcript actually looks like. The AI transcribes the beginning of the song like:

“He said let's get out of this town Drive out other cities away from the crowd”

Pretty good, considering the actual words are “he said let’s get out of this town, drive out of the city, away from the crowd.” The AI only messed up “of the” into “other”, which is reasonable given the music in the background, and these words sound pretty similar when you say them out loud! What about “Thriller”? How does "Thriller" start?

The first line in the song goes “It’s close to miiiiiidnight, and something evil’s lurking in the dark.” What does the AI predict for the first line?

“It's close to me inside than people are looking in the door to the moon.” 

Why does AI struggle to recognize when Michael Jackson sings “midnight”? Could it be because he sings it “miiiiiidnight”, with most of the emphasis on the first syllable? Earlier we commented on how the vowel to consonant ratio for singing could be almost 40 times the ratio for speech; this looks like an example of that distortion in action.

How Well Does AI Recognize Rock Music?

We’ll get the rock song transcripts the same way we got the pop songs, download them from YouTube and transcribe them with AssemblyAI’s Speech-to-Text API. We’ll analyze the following five iconic rock songs "I Love Rock N Roll" by Joan Jett, "American Idiot" by Green Day, "Born To Be Wild" by Steppenwolf, "You Shook Me All Night Long" by AC/DC, and "Don't Stop Believing" by Journey.

The lowest WER score from these rock songs was "Don't Stop Believing" by Journey with a WER score of 0.735. Surprisingly, the lowest WER score here is about equal to the average from the pop songs picked above.

From second lowest to highest, the songs and WER scores are "You Shook Me All Night Long" by AC/DC at 0.802, "American Idiot" by Green Day at 0.807, "Born To Be Wild" by Steppenwolf at 0.820, and finally the highest WER score was from "I Love Rock N Roll" by Joan Jett and the Blackhearts at 0.834. This results in an average word error rate of 0.80 across these 5 rock songs. From this analysis, it appears that rock is a harder genre than pop for current Automatic Speech Recognition software to recognize.

Time for a deeper dive. Let’s take a look at some of the transcriptions for these songs to see what’s going on behind the scenes. I’m only going to briefly touch on the transcript for "Don't Stop Believing".

Just a small town … now where I don't stop Believin hold on to the feel …

Funny enough, AI was able to transcribe the iconic "Don't Stop Believing" lyrics at least once. As humans, we can easily understand “Don't Stop Believing”, but let’s look at a more interesting case, one that's harder to understand for most English speakers.

Next, let’s look at the song “You Shook Me All Night Long''. First of all, we should acknowledge that most humans would struggle to transcribe the lyrics of this song. That being said, it’s surprising that AI was able to transcribe this song better than the others, because the other songs are much more intelligible.. The AI predicts the first line of this song as:

“So I she was out when she can create one kick off”

Not the intro to You Shook Me All Night Long

Clearly these are not the words, but can you really fault the machine? Take a listen and you tell me what the real words are.

Finally, let’s take a look at another song, “I Love Rock N Roll” was difficult to translate, but we gotta give it to the AI for getting this part of the "I Love Rock N Roll" right:

… it would be long it was with me? Yeah, me and I could tell it wouldn't be long it was with me? Yeah. Me singing I love …

One of the most iconic lines of the song. But somehow, the transcript manages to not even pick up the words "I Love Rock N Roll" a SINGLE time. Perhaps this is because it’s not much louder than the claps and music in the background, perhaps it is because it is background vocals and the repetitive nature makes it hard for the AI to understand.

How Well Does AI Recognize RnB Music?

A couple of these RnB songs are throwbacks like "One Two Step" by Ciara, "Adorn" by Miguel, and "Umbrella" by Rihanna. The other two (more modern) songs we’ll pick to round out the RnB selection for this analysis are "Hotline Bling" by Drake, and "Kiss Me More" by Doja Cat.

Alright, let’s see how Speech Recognition technology does on RnB music.

Amazingly, the lowest WER score goes to "Hotline Bling" by Drake at 0.473. This is significantly better than anything else we’ve analyzed so far. It means that the AI managed to transcribe more than half of this song correctly. Unfortunately, that’s where the limits for AI Speech Recognition technology on RnB music come to a stop. The next lowest WER is 0.734 for "Umbrella" by Rihanna. What is it about Drake that makes him so easily understandable to automatic Speech Recognition technology? Could it be because he doesn’t exactly sing?

The next three songs from lowest to highest word error rates and their WER scores are "One Two Step" by Ciara at 0.758, "Kiss Me More" by Doja Cat feat SZA at 0.793, and "Adorn" by Miguel, which objectively should have been a more recognizable song, was the highest at 0.819. The average WER score among these five RnB songs is 0.715, which is the lowest so far, but still not promising.

“Hotline Bling” by Drake is actually transcribed remarkably well with automatic Speech Recognition. It, very impressively, manages to get the first couple bars almost entirely correct -

You used to call me on my you used to you used to yeah, you used to call me on my cellphone Late night when you need my love Call me on my cellphone Late night when you need my love And I know when the line … 

Literally only messing up “I know when that hotline bling” into “I know when the line please”, which is pretty incredible. First of all, I’m sure no one ever said “I know when that hotline bling” in real life prior to this song, and if you say it today, it’s probably ironic. If you listen to the rest of the song you’ll probably agree with our analysis that part of Drake’s interpretability to AI Speech Recognition technology is that he’s mostly talking, and not singing.

Special Feature - Rap God by Eminem

Alright, we all know Rap God by Eminem. Maybe you even, embarrassingly, tried to rap the fast part of this song at some point. It’s okay, we’ve all been there. We already know this song is hard for humans to transcribe, but let’s see how the AI does.

Given the speed of speech in this song, it should be difficult for Automatic Speech Recognition models to transcribe this song. That being said, the actual WER was 0.654!. This is the second most easily transcribed song by AI. Once again, it appears rap music is easier to recognize because it’s the closest to regular speech.

Conclusion

Our initial hypothesis that our AI transcription would get WER scores of higher than 0.95 turns out to be false, as not a single song was translated so poorly to only get 5% of the words right. Turns out, state of the art AI Speech Recognition technology like AssemblyAI’s Speech-to-Text API can recognize about 20 to 30% of lyrics in a song. If we were to strip out the background noise/music, and isolate just the vocals, we’d expect the lyrics to be transcribed much more accurately. Another option to increase AI recognition of song lyrics would be to fine tune a Speech Recognition model on songs, with singing and background noise making up a large population of the training data.