For AI agents: a documentation index is available at the root level at /llms.txt and /llms-full.txt. Append /llms.txt to any URL for a page-level index, or .md for the markdown version of any page.
PlaygroundChangelogSign In
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
OverviewAPI ReferencePre-recorded STTStreaming STTVoice AgentsSpeech UnderstandingGuardrailsLLM GatewayFAQ
  • Overview
      • Am I charged for transcribing silent audio?
      • Are Custom Models More Accurate than General Models?
      • Do I Get Charged for Failed API Calls?
      • Are there any limits on file size or file duration for files submitted to the API?
      • Can I customize how words are spelled by the model?
      • Can I delete the transcripts I have created using the API?
      • Can I get a list of all transcripts I have created?
      • Can I send audio to AssemblyAI in segments and still get speaker labels for the whole recording?
      • Can I submit files to the API that are stored in a Google Drive?
      • Can I use the API without internet access?
      • Do we have resources for building with Make?
      • Do you have any examples for how to use your API?
      • Do you have example use cases for using AssemblyAI?
      • Do you offer cross-file Speaker Identification?
      • Do you offer translation?
      • Do you offer voice-to-voice or text-to-speech (TTS)?
      • Does it cost extra to export SRT or VTT captions?
      • Is there a way to generate SRT or VTT captions with speaker labels?
      • Does it cost more to transcribe an audio or video?
      • Does your API return timestamps for individual words?
      • How are individual speakers identified and how does the Speaker Label feature work?
      • How are paragraphs created for the /paragraphs endpoint?
      • How are word/transcript level confidence scores calculated?
      • How can I integrate AssemblyAI with other services?
      • How can I make certain words more likely to be transcribed?
      • How can I test AssemblyAI without writing code?
      • How can I transcribe YouTube videos?
      • How do I generate subtitles?
      • How does AssemblyAI compare to other ASR providers?
      • How does Automatic Language Detection work?
      • How does the API handle files that contain spoken audio in multiple languages?
      • How long does it take to transcribe a file?
      • What should I do if I'm getting an error?
      • Is there a Postman collection for using the API?
      • Is there a way for us to send the start time / end time for transcription instead of transcribing the whole length of a call recording?
      • Is there an OpenAPI spec/schema for the API?
      • read operation timed out" error
      • Should I use Speaker Labels or Multi-channel?
      • What are the recommended options for audio noise reduction?
      • What audio and video file types are supported by your API?
      • What IP Address Should I Whitelist for AssemblyAI?
      • What is the minimum audio duration that the API can transcribe?
      • What is the recommended file type for using your API?
      • What types of audio URLs can I use with the API?
      • Where can I find a list of recent changes to the API?
      • Where can I find cURL code examples?
      • Why can't I access recording URLs from the /upload endpoint directly?
LogoLogo
PlaygroundChangelogSign In
OverviewPre-recorded audio

Are Custom Models More Accurate than General Models?

In the field of ASR, custom models are rarely more accurate than the best general models (learn more about one measure of accuracy, Word Error Rate or WER, here). This is because general models are trained on huge datasets, and are constantly maintained and updated using the latest deep learning research.

For example, at AssemblyAI, we train large deep neural networks on over 12.5 million hours of speech data. This training data is a mix of many different types of audio (broadcast TV recordings, phone calls, Zoom meetings, videos, etc), accents, and speakers. This massive amount of diverse training data helps our ASR models to generalize extremely well across all types of audio/data, speakers, recording quality, and accents when converting Speech-to-Text in the real world.

Custom models usually come into the mix when dealing with audio data that have unique characteristics unseen by a general model. However, because large, accurate general models see most types of audio data during training, there are not many “unique characteristics” that would trip up a general model - or that a custom model would even be able to learn.

To learn more about this topic, see this blog post.

Was this page helpful?
Previous

Do I Get Charged for Failed API Calls?

Next
Built with