Are Custom Models More Accurate than General Models?

In the field of ASR, custom models are rarely more accurate than the best general models (learn more about one measure of accuracy, Word Error Rate or WER, here). This is because general models are trained on huge datasets, and are constantly maintained and updated using the latest deep learning research. For example, at AssemblyAI, we train large deep neural networks on over 12.5 million hours of speech data. This training data is a mix of many different types of audio (broadcast TV recordings, phone calls, Zoom meetings, videos, etc), accents, and speakers. This massive amount of diverse training data helps our ASR models to generalize extremely well across all types of audio/data, speakers, recording quality, and accents when converting Speech-to-Text in the real world. Custom models usually come into the mix when dealing with audio data that have unique characteristics unseen by a general model. However, because large, accurate general models see most types of audio data during training, there are not many “unique characteristics” that would trip up a general model - or that a custom model would even be able to learn. To learn more about this topic, see this blog post.