October 29, 2021

Deep Shallow Fusion for RNN-T Personalization

This week’s Deep Learning Research paper is “Deep Shallow Fusion for RNN-T Personalization.”

Michael Nguyen

AI Concepts

Reviewed by

Table of contents

[Visible on live site]

This week’s Deep Learning Research paper is “Deep Shallow Fusion for RNN-T Personalization.”

What’s Exciting About this Paper

End-to-end deep learning ASR models can produce highly accurate transcriptions, but they are a lot harder to personalize. Their end-to-end nature lacks composability, such as that between acoustic, language, and pronunciation models. Lack of composability leads to challenges personalization, making it harder to accurately predict custom vocabularies and rare proper nouns. This paper walks through some methods that help increase the accuracy of proper nouns and rare words from end-to-end deep learning models.

Key Findings

The paper talks about four different techniques to help improve proper nouns. But we thought two, in particular, were more interesting as they were simpler but still produced good accuracy improvements. With these suites of training tricks, you can improve the models’ ability to predict proper nouns and rare words.

Subword regularization: During training, instead of directly feeding the highest probable prediction from previous timesteps into the current timestep, you can sample from a list of n-best outputs, and use that as input. This makes it so the model doesn’t overfit on high-frequency words and should predictions for rarer words
Grapheme-2-Grapheme: You can use a G2G model to augment your dataset! G2G models can transform a word into alternative spellings with similar pronunciations, such as “Kaity” → “Katie.” Using G2G to generate additional pronunciations for decoding led to significant improvement in rare name recognition.

Our Takeaways

End-to-end ASR models can overfit high-frequency words, making it hard for the model to predict rare words. By augmenting the data with G2G and adding a little randomness into the training regime, you can reduce the overfitting of high-frequency words, and train the model to increase the probability of predicting low-frequency words like proper nouns.

Deep Shallow Fusion for RNN-T Personalization

What’s Exciting About this Paper

Key Findings

Our Takeaways

The best audio file formats for speech-to-text: A guide

What is audio intelligence or speech understanding?

What is speaker diarization and how does it work? (Complete 2026 Guide)

Text Summarization for NLP: 5 Best APIs, AI Models, and AI Summarizers in 2026

Sentiment Analysis in Action - Earnings Calls

How to use audio data in LangChain with Python

Automatic summarization with LLMs in Python

How to evaluate AI models and systems: Why objective benchmarks are important

Deep Shallow Fusion for RNN-T Personalization

What’s Exciting About this Paper

Key Findings

Our Takeaways

Related posts

The best audio file formats for speech-to-text: A guide

What is audio intelligence or speech understanding?

What is speaker diarization and how does it work? (Complete 2026 Guide)

Text Summarization for NLP: 5 Best APIs, AI Models, and AI Summarizers in 2026

Sentiment Analysis in Action - Earnings Calls

How to use audio data in LangChain with Python

Automatic summarization with LLMs in Python

How to evaluate AI models and systems: Why objective benchmarks are important