March 5, 2025
Releases

Raising the bar for Speech AI: Announcing a first of its kind Speech Language Model and improved Streaming model

Today, we're announcing two upcoming products that will be released over the coming weeks and months and are set to redefine industry standards for Speech AI capabilities.

By 
Dylan Fox
Founder, CEO

I’m excited to share two upcoming product announcements today that will be released over the coming weeks and months and are set to redefine industry standards for Speech AI capabilities. A first-of-its-kind, promptable Speech Language Model, which we’ve been calling Slam-1, built to optimize accuracy for specific applications and industries, and a brand new streaming speech-to-text model for Voice Agents with industry leading latency, accuracy, and endpointing.

The future of Speech AI 

Speech AI adoption is skyrocketing in 2025—it’s fueling everything from AI meeting notes to voice agents to medical scribes and more. We’ve seen usage of our own APIs grow nearly 300% over the past 12 months, yet nearly two thirds of AI product leaders report they’re still preparing to implement multi-modality and Speech into their AI applications. To put this into perspective, data shows that 70% of call centers plan to incorporate some form of AI into their operations, and 85% of customer service agents plan to pilot Speech-Driven Conversational AI solutions — signaling that we’re still in the very early innings of Speech-Driven AI applications.

We’re seeing huge market acceleration across a number of Speech-Driven AI workflows that span industries including Conversational AI, Voice Agents, Medical, and more.

At AssemblyAI, we’ve had the privilege to work with more than 400K developers and thousands of businesses building innovative Speech-Driven AI applications over the past 3 years – including incredible companies like Fireflies, Siro, CallRail, and Veed. With over 10K customers using our API each month,  we’ve been given a front row seat to where existing Speech AI technology is still bottlenecked in real world Speech-Driven AI applications.

First, what’s very clear is that the industry does not need more generic Speech AI solutions. Standard industry benchmarks around WER have become saturated and have drifted from what real-world Speech-Driven AI applications and their end-users actually care about. This is why at AssemblyAI we’ve pushed ourselves to go beyond a generic accuracy and now focus on better metrics to guide our R&D that tell a more complete picture from research to engineering – like subjective quality scores from human raters, A/B testing in real world applications, and more granular quantitative measurements around rare word accuracy, formatting, hallucinations, etc, as well as end-to-end latency and stability  – all which better measure downstream impact than WER scores. 

It’s this focus that has led us to the two major product breakthroughs that I’m excited to announce today.

Introducing the industry’s first production-ready Speech Language Model

Later this month we’ll be rolling out the industry’s first production-ready speech language model, which we’ve been calling Slam-1, that enables rapidly customizable speech-to-text and speech understanding tasks via prompting. With Slam-1, developers and organizations can quickly optimize accuracy and capabilities for their specific application and industry– across industry terminology, formatting, and other dimensions.

To optimize today’s Speech AI models for a specific industry or application, organizations have had to either develop custom models to train these requirements into the model itself – a process which is both cumbersome and expensive – or implement complex post-processing techniques which are brittle and unreliable. But since Slam-1 is a Speech Language Model, it makes rapid customization available through simple prompting.

To create Slam-1, we train an adapter layer that bridges the gap between our high-performance speech encoder and a pretrained LLM. This adapter effectively projects the acoustic representations to the language model's semantic space. We fine-tune the adapter and the encoder on high-quality data while keeping the LLM weights fixed: this allows us to leverage the inherent LLM capabilities, ensures robustness to hallucinations, and enables context-sensitive transcription. In our preliminary results, prompting has reduced the miss-detection rate of specified named entities by 74% while reducing the medical term miss-detection rate by 59%. To facilitate practical transcription use while ensuring high-quality outputs with minimal hallucinations, we are packaging the model with key features such as speaker diarization and timestamp prediction and gradually exposing its capabilities through our APIs.

Slam-1 is in private beta now with select partners, and is launching in the coming weeks with an initial set of capabilities.

You can sign up for the current waitlist or to get notified when Slam-1 is generally available here.

Over the coming months, we plan to ship continuous new versions of this model that come with more capabilities like emotion detection, more advanced promptability and better instruction following, and various deployment options including our API, deploying on your own servers, and via select inference partners.

Our New Streaming API

Over the past year, we’ve seen an explosion in organizations building and deploying Voice Agents with our API. This has allowed us to not only get feedback from thousands of customers about what they want to build next on AssemblyAI's API, but also allows us to build new products and features with their specific needs in mind. That's why today, we're excited to announce our new Streaming API will launch in private beta at the end of this month designed specifically for Voice Agents use cases featuring:

  • Super low latency  in both partial and final transcriptions
  • Intelligent endpointing able to distinguishing between pauses and utterance completion by looking at both semantic and audio information simultaneously
  • High accuracy in key areas like spelling out names and rare words
  • Robust to background speakers/noise in real-world environments

To achieve this, we’ve created a specialized streaming version of our industry leading Universal model that is purpose built for Voice Agents. 

We're looking forward to AssemblyAI's new streaming STT model which will empower our customers to build voice agents that don't just respond quickly, but actually understand what people are saying. ~ Jordan Dearsley, Founder and CEO of Vapi

Our new Streaming API is going into beta later this month. You can sign up for the waitlist here.

Building what’s next with our customers’ help

Our vision at AssemblyAI has always been to build the industry’s leading API platform to transform and understand speech data with AI, enabling developers and organizations to build amazing new products and services for the world. We’ve been blown away by the response to our speech-to-text and speech understanding models, and have loved seeing the incredible products that customers have built. 

As we look to what’s next for Speech AI, we’ve never felt more confident in the opportunity that exists in the market and that the next generation of great products are currently being built today by customers like you. 

Both Slam-1 and our new streaming model are two of our first major product updates coming in 2025, and we remain committed to making improvements across our product capabilities and solving the last-mile issues that are most important to your teams. This includes improvements to speaker diarization and speaker identification, continued reductions to latency and more purchasing options, and the ability to run AssemblyAI on-device, in your own data center, or on the cloud, depending on your organization’s needs. 

Streaming Speech-to-Text
LLMs