Significant advancements in Speech AI research are making live Speech-to-Text transcription more accurate than ever before. This has led to growing demand for high-quality AI tools, such as AI voice bots for call centers and voice assistants for customer service, that leverage live Speech-to-Text technology.
With AssemblyAI’s Streaming Speech-to-Text (previously Real-Time) model, users can expect to build with the same powerful technology under a new name and a few improvements:
- More customization and control
- Lower cost to build (which was originally announced this past January)
These updates make it easier to build next-generation AI tools and products on top of live speech transcription.
Advanced use cases for Streaming Speech-to-Text
Historically, live Speech-to-Text users only had access to limited AI technology and were not content with how stilted and unnatural conversations felt when using other tools.
AssemblyAI’s Streaming Speech-to-Text (Streaming STT) offers a best-in-class experience for users who are looking for a seamless option. Streaming STT includes accurate, customizable end-of-utterance detection, which ensures that conversations are transcribed more naturally to enable better AI-human interactions.
Companies are now using Streaming STT for a variety of purposes:
- Live captions for streaming audio and video
- AI voice assistants and voice bots for customer service, call centers and sales applications
- Language learning tools
- Accessibility applications
- Virtual meetings
How to customize end-of-utterance detection
End-of-utterance detection enables the live speech-to-text model to identify when the human is finished speaking.
With our recent update to Streaming STT, developers can now customize how and when a speaker is done talking.
Developers can modify end-of-utterance detection in two ways:
- By deciding that the model will wait for less (or more) silence to transpire before declaring that the speaker is done speaking. This is accomplished by modifying the parameters displayed in the commented line.
import assemblyai as aai
transcriber = aai.RealtimeTranscriber(
on_data=on_data_callback,
on_error=on_error_callback,
sample_rate=sample_rate,
end_utterance_silence_threshold=300 # Custom threshold for end of utterance detection
)
transcriber.connect()
audio_stream = ...
for audio_chunk in audio_stream:
transcriber.stream(audio_chunk)
- By forcing an end of utterance to happen programmatically. This is accomplished by modifying the parameters displayed in the commented line.
import assemblyai as aai
transcriber = aai.RealtimeTranscriber(
on_data=on_data_callback,
on_error=on_error_callback,
sample_rate=sample_rate,
)
transcriber.connect()
audio_stream = ...
for audio_chunk in audio_stream:
transcriber.stream(audio_chunk)
speaker_changed = ...
if speaker_changed:
transcriber.force_end_utterance() # Steers the model to produce a final transcript
With these new controls, developers can build more natural interactions with their AI tools, giving their users a better overall experience with live Speech-to-Text applications.
Unlock powerful live Speech-to-Text use cases with faster, lower-cost STT
Streaming Speech-to-Text allows users to transcribe live audio streams with high accuracy and low latency at a lower price of $0.47 per hour (reduced from $0.75 per hour), or $0.0001306 per second, of audio data. This includes access to the Streaming Speech-to-Text model, Automatic Punctuation and Casing, and Custom Vocabulary.
To use the service, users stream audio data to our secure WebSocket API and receive transcripts back within a few hundred milliseconds.
Users can follow along with step-by-step instructions in our docs to get started, or by using one of our official SDKs.