How to convert speech to text in Java

Converting speech to text in Java can present challenges due to the complexity of audio processing and the need for accurate speech recognition. However, modern libraries and cloud-based APIs have made it easier to implement these features in Java applications. This article focuses on how to convert speech to text using AssemblyAI's Java SDK, a powerful solution for high-accuracy transcription tasks.

For the transcription step in this guide, we’ll use the AssemblyAI API, which offers free access for over 100 hours of transcription and allows for seamless integration through its Java SDK. AssemblyAI simplifies the transcription process, but we’ll also explore other Speech-to-Text options available for Java, giving a comprehensive view of the tools at your disposal.

Set up the AssemblyAI Java SDK

Prerequisites:

Java development environment (e.g., IntelliJ IDEA, Eclipse)
AssemblyAI API key (Sign up here)

Installation

To start, add the AssemblyAI Java SDK to the project by including the following dependency in Maven or Gradle:

Maven:

<dependency>
    <groupId>com.assemblyai</groupId>
    <artifactId>assemblyai-java</artifactId>
    <version>ASSEMBLYAI_SDK_VERSION</version>
</dependency>

Gradle:

dependencies {
    implementation 'com.assemblyai:assemblyai-java:ASSEMBLYAI_SDK_VERSION'
}

Speech-to-Text in Java using AssemblyAI's Java SDK

To begin implementing speech-to-text functionality, create a new App.Java file in your Java project and insert the following code:

import com.assemblyai.api.AssemblyAI;
import com.assemblyai.api.resources.transcripts.types.*;

public final class App {
    public static void main(String[] args) {
        AssemblyAI client = AssemblyAI.builder()
                .apiKey("YOUR_API_KEY")
                .build();

        var params = TranscriptOptionalParams.builder()
                .speakerLabels(true)
                .build();

        // You can use a local file:
        /*
        Transcript transcript = aai.transcripts().transcribe(
                new File("./example.mp3"), params);
        */

        // Or use a publicly-accessible URL:
        String audioUrl = "https://assembly.ai/sports_injuries.mp3";
        Transcript transcript = client.transcripts().transcribe(audioUrl, params);

        if (transcript.getStatus().equals(TranscriptStatus.ERROR)) {
          System.err.println(transcript.getError().get());
          System.exit(1);
        }

        System.out.println(transcript.getText().get());

        transcript.getUtterances().get().forEach(utterance ->
                System.out.println("Speaker " + utterance.getSpeaker() + ": " + utterance.getText())
        );
    }
}

Breaking Down the Code

Step 1: Importing Required Classes

import com.assemblyai.api.AssemblyAI;
import com.assemblyai.api.resources.transcripts.types.*;

The first step involves importing the necessary classes from the AssemblyAI Java SDK. These classes handle the client configuration and manage the transcript data types.

Step 2: Building the AssemblyAI Client

AssemblyAI client = AssemblyAI.builder()
        .apiKey("YOUR_API_KEY")
        .build();

This block initializes an instance of the AssemblyAI client. The API key is passed as a parameter to authenticate requests with AssemblyAI's API.

Step 3: Setting Transcript Parameters

var params = TranscriptOptionalParams.builder()
                .speakerLabels(true)
                .build();

In this part of the code, optional parameters for the transcription are defined. In this example, the speakerLabels option is set to true, which enables speaker diarization. This feature differentiates between speakers in the transcript, making it useful for multi-speaker recordings.

Step 4: Transcribing an Audio File

String audioUrl = "https://assembly.ai/sports_injuries.mp3";
Transcript transcript = client.transcripts().transcribe(audioUrl, params);

The code sets a public URL (audioUrl) pointing to an audio file. The transcribe method is then called with the audio URL and the optional parameters (params). This method sends the audio to AssemblyAI for transcription and returns a Transcript object containing the transcribed text and metadata.

Step 5: Handling Errors

if (transcript.getStatus().equals(TranscriptStatus.ERROR)) {
    System.err.println(transcript.getError().get());
    System.exit(1);
}

This section checks whether the transcription process encountered any errors. If an error occurs, the program prints the error message and terminates the execution.

Step 6: Printing The Transcript

System.out.println(transcript.getText().get());
}

The transcribed text is retrieved from the Transcript object and printed to the console. This is the most basic usage of the AssemblyAI API to convert speech to text.

Step 7: Printing Speaker Labels (if enabled)

transcript.getUtterances().get().forEach(utterance ->
        System.out.println("Speaker " + utterance.getSpeaker() + ": " + utterance.getText())
);

If speaker labels are enabled, this part of the code iterates through each utterance (a spoken segment) and prints the speaker’s label along with the corresponding text. This allows the transcript to show which speaker said what, enhancing clarity in multi-speaker scenarios.

Step 8: Running the Java Project

Once the code has been implemented, the final step is to run the project. Follow these steps to execute the Java project file:

Compile the Java file: Open a terminal or command prompt and navigate to the directory where the App.java file is saved. Compile the file using the following command:

javac App.java

Run the Java file: After the file has compiled successfully, run the project using the following command:

java App

The program will output the transcribed text directly to the console. If speaker labels were enabled, the output will show which speaker said what, along with the text from the audio.

By following these steps, the Java project should successfully convert the speech in the provided audio file to text using AssemblyAI’s Java SDK.

Speech-to-Text options for Java apps

AssemblyAI is only one of several options to implement Speech-to-Text in Java applications. When selecting a speech recognition library for Java, it’s important to weigh the benefits and limitations of both open-source and cloud-based solutions. Each option has its unique advantages, depending on project requirements.

Below is a list weighing both approaches and factors to consider when selecting the best speech-to-text option in Java.

Open-Source Speech-to-Text APIs

CMU Sphinx (Sphinx-4): CMU Sphinx is a long-established open-source speech recognition system that supports offline functionality, which can be critical for applications needing privacy and no dependency on internet connectivity. However, its accuracy tends to be lower compared to modern cloud-based solutions, making it suitable primarily for small projects or those with limited accuracy needs.

Ideal use case: Projects that require offline support without needing real-time or highly accurate transcription.
Installation: Available via Maven or Gradle.

Cloud-based Speech-to-Text APIs

Google Cloud Speech-to-Text: Google Cloud provides a highly accurate and scalable speech-to-text API, supporting multiple languages and dialects. It is particularly well-suited for enterprise-level applications that require scalability and accuracy. However, being a cloud-based service, it requires an active internet connection and a Google Cloud account, which can introduce potential costs depending on usage.

Ideal use case: Large-scale, multilingual applications that prioritize high accuracy.
Installation: Requires the Google Cloud SDK.

AssemblyAI: AssemblyAI stands out with its ease of setup and powerful features, such as speaker diarization, sentiment analysis, and real-time transcription. This cloud-based solution offers high accuracy and flexibility, making it suitable for both batch and real-time transcription needs. The advantages it brings in terms of advanced features, high accuracy and reliability make it a top choice for developers.

Ideal use case: Large-scale applications that need high-accuracy speech-to-text, advanced features, or scalability.
Installation: Easily integrated with Java projects through the AssemblyAI Java SDK (see the step-by-step guide above).

Here are guides such as How to Choose the Best Speech-to-Text API and The top free Speech-to-Text APIs, AI Models, and Open Source Engines to help you pick the right solution for your project.

Empower Java Applications with AssemblyAI

AssemblyAI provides a robust and scalable solution for speech-to-text in Java. Its ease of integration, along with features like speaker diarization and high accuracy, makes it an ideal choice for developers building transcription features into their applications. Check out the full documentation for additional details on how to implement these features in Java or sign up for an API key to get started.