Tutorials

Speech recognition in the browser using Web Speech API

Learn how to set up speech recognition in your browser using the Web Speech API and JavaScript.

Speech recognition in the browser using Web Speech API

Speech recognition has become an increasingly popular feature in modern web applications. With the Web Speech API, developers can easily incorporate speech-to-text functionality into their web projects. This API provides the tools needed to perform real-time transcription directly in the browser, allowing users to control your app with voice commands or simply dictate text.

In this blog post, you’ll learn how to set up speech recognition using the Web Speech API. We’ll create a simple web page that lets users record their speech and convert it into text using the Web Speech API. Here is a screenshot of the final app:

Final app: Speech Recognition in your browser using the Web Speech API

 Before we set up the app, let’s learn about the Web Speech API and how it works.

What is the Web Speech API?

The Web Speech API is a web technology that allows developers to add voice capabilities to their applications. It supports two key functions: speech recognition (turning spoken words into text) and speech synthesis (turning text into spoken words). This enables users to interact with websites using their voice, enhancing accessibility and user experience.

The Web Speech API consists of two parts:

  • SpeechRecognition: Provides functionality to capture audio input through the user’s microphone, converts it into digital signals, and sends this data to a cloud-based speech recognition engine, such as Googles Speech Recognition. The engine processes the speech and returns the transcribed text back to the browser. This happens in real-time, allowing for dynamic, continuous transcription or voice command execution as the user speaks. Here’s a minimal code example of the SpeechRecognition interface:
// Set up a SpeechRecognition object
const recognition = new SpeechRecognition();
// Start and stop recording
recognition.start();
recognition.stop();
// Handle the result in a callback
recognition.addEventListener("result", onResult);
  • SpeechSynthesis: This part of the API takes text provided by the application and converts it into spoken words using the browser’s built-in voices. The exact voice and language used depend on the user’s device and operating system, but the browser handles the synthesis locally without needing an internet connection.

The Web Speech API abstracts these complex processes, so developers can easily integrate voice features without needing specialized infrastructure or machine learning expertise.

Prerequisites

Let’s walk through each step of setting up the Web Speech API on a website, and by the end, you’ll have a fully functional speech recognition web app.

To follow along with this guide, you need:

The full code is also available on GitHub here.

Step 1: Set up the Project Structure

First, create a folder for your project, and inside it, add three files:

  • index.html: To define the structure of your web page.
  • speech-api.js: To handle speech recognition using JavaScript.
  • style.css:  To style the web page.

Step 2: Write the HTML File

We’ll start by writing the HTML code that will display the speech recognition UI. The page should contain a button for starting and stopping the recording, and a section for displaying the transcription results.

Add the following code to index.html:

<!DOCTYPE html>
<html lang="en">
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Web Speech API example</title>
    <link rel="stylesheet" href="./style.css" />
  </head>
  <body>
      <h1>Web Speech API example</h1>
      <p>Click the button and start speaking</p>
      <button id="recording-button">Start recording</button>
      <div id="transcription-result"></div>
      <p id="error-message" hidden aria-hidden="true">
        Button was removed<br>Your browser doesn't support Speech Recognition with the Web Speech API
      </p>
    <script src="speechAPI.js"></script>
  </body>
</html>

This HTML sets up a simple layout with a button that will trigger speech recognition and a div to display the transcription results. If the Web Speech API isn’t supported by the browser, an error message will appear. The error message is hidden initially but can be made visible through JavasScript.

At the bottom of the body, we’ll include a script that points to the speech-api.js file with the Web Speech API logic.

Step 3: Implement Speech Recognition API logic

Now, we’ll move on to writing the JavaScript code to handle speech recognition. Create the speech-api.js file and add the following code:

window.addEventListener("DOMContentLoaded", () => {
  const recordingButton = document.getElementById("recording-button");
  const transcriptionResult = document.getElementById("transcription-result");
  let isRecording = false;
  const SpeechRecognition =
    window.SpeechRecognition || window.webkitSpeechRecognition;
  if (typeof SpeechRecognition !== "undefined") {
    const recognition = new SpeechRecognition();
    recognition.continuous = true;
    recognition.interimResults = true;
    const onResult = (event) => {
      transcriptionResult.textContent = "";
      for (const result of event.results) {
        const text = document.createTextNode(result[0].transcript);
        
        const p = document.createElement("p");
        p.appendChild(text);
        if (result.isFinal) {
            p.classList.add("final");
        }
        transcriptionResult.appendChild(p);
      }
    };
    const onClick = (event) => {
      if (isRecording) {
        recognition.stop();
        recordingButton.textContent = "Start recording";
      } else {
        recognition.start();
        recordingButton.textContent = "Stop recording";
      }
      isRecording = !isRecording;
    };
    recognition.addEventListener("result", onResult);
    recordingButton.addEventListener("click", onClick);
  } else {
    recordingButton.remove();
    const message = document.getElementById("error-message");
    message.removeAttribute("hidden");
    message.setAttribute("aria-hidden", "false");
  }
});

Explanation of the JavaScript Code

  1. Checking browser support: We first check whether the SpeechRecognition API is supported by the browser. If not, we hide the recording button and display an error message.
  2. Setting up Speech Recognition:
    • We initialize the SpeechRecognition object and set continuous to true so that the API continuously listens to the user’s speech until it’s manually stopped.
    • interimResults is also set to true so that users can see the live transcription in real-time, instead of only showing text when the end of a sentence is detected.
  3. Handling the speech event: The onResult function is triggered whenever speech recognition detects spoken words. It iterates over the recognized results and updates the transcriptionResult div with the spoken text. Final results (when the speech has completed) are styled differently using the .final class.
  4. Handling the button click: The onClick function toggles the recording state. If speech recognition is active, it will stop the recognition; otherwise, it will start listening for speech.

Step 4: Style the Web Page

Next, let’s add some styles to make the page a bit more visually appealing. Create the style.css file and add the following styles:

html,
body {
  font-family: Arial, sans-serif;
  text-align: center;
}
#transcription-result {
  font-size: 18px;
  color: #5e5e5e;
}
#transcription-result .final {
  color: #000;
}
#error-message {
  color: #ff0000;
}
button {
  font-size: 20px;
  font-weight: 200;
  color: #fff;
  background: #2f2ff2;
  width: 220px;
  border-radius: 20px;
  margin-top: 2em;
  margin-bottom: 2em;
  padding: 1em;
  cursor: pointer;
}
button:hover,
button:focus {
  background: #2f70f2;
}

This CSS file ensures the button is easily clickable and the transcription result is clearly visible. The .final class makes the final transcription results appear in bold black. Every time the end of a sentence is detected, you’ll notice the interim gray text changes to black text.

Step 5: Test the Web App

Once everything is in place, open the index.html file in a browser that supports the Web Speech API (such as Google Chrome). You should see a button labeled "Start recording". When you click it, the browser will prompt you to grant permission to use the microphone.

After you allow the browser access, the app will start transcribing any spoken words into text and display them on the screen. The transcription results will continue to appear until you click the button again to stop recording.

Conclusion

You’ve learned what the Web Speech API is and how you can use it. With just a few lines of code, you can easily add speech recognition to your web projects using the Web Speech API. Check out the official documentation to learn more.

If you’re looking for an alternative with more features and higher transcription accuracy, we also recommend trying out the AssemblyAI JavaScript SDK.

To learn more about how you can analyze audio files with AI and get inspired to build more Speech AI features into your app, check out more of our blog, like this article on Adding Punctuation, Casing, and Formatting to your transcriptions, or this guide on Summarizing audio with LLMs in Node.js.