Analyze Audio from Zoom Calls with AssemblyAI and Node.js

Including AI in virtual meetings can significantly enhance efficiency and the user experience. For example, AI speech services can provide and analyze transcriptions to provide various insights from the meeting, such as an actionable item list or a detailed summary.

In this tutorial, you'll learn how to get and parse audio from a Zoom call, transcribe the audio using AssemblyAI, and then process and analyze the transcribed audio using AssemblyAI's LeMUR API and audio intelligence models.

Set up your development environment

To follow this tutorial, you first need to set up your development environment, install the required dependencies, and retrieve and securely store your AssemblyAI API key.

First, create a new project directory and cd into it by running the command below in your terminal:

mkdir assemblyai-zoom && cd assemblyai-zoom

Then, run the command below to initialize npm in your project:

npm init -y

Run the following command to install the dependencies required for your project:

npm install assemblyai dotenv fluent-ffmpeg node-media-server

Here's short description of each of the packages you just installed:

assemblyai: The AssemblyAI's Node.js SDK, which will make it easier for you to interact with AssemblyAI.
dotenv: Dotenv is a Node.js package that loads environmental variables into process.env. You'll need this package to manage your environmental variables.
fluent-ffmpeg: This is a Node.js library that abstracts the complex command line usage of FFmpeg into a simple-to-use package. You'll need this package to process your audio before sending it to AssemblyAI.
node-media-server: Node Media Server is a Node.js implementation of several media servers, including RTMP. You'll need this package to create your custom streaming server where you receive an RTMP stream.

Next, create a .env file in your project's root directory and add the variable below to it:

ASSEMBLYAI_API_KEY =

In the next section, you'll get your AssemblyAI API key and add it to this .env file.

Get your API key

To get your AssemblyAI API key, you need to have an AssemblyAI account; you can sign up for free if you don't have one.

If you already have an AssemblyAI account, go ahead and log in. On your welcome dashboard, you should see your API key on the right side of your screen:

Click the Copy API key button and paste the API key in your .env file as the value of ASSEMBLYAI_API_KEY.

Get audio from a Zoom call

There are multiple ways you can get audio from a Zoom call. Still, the most feasible options are streaming the call to a custom media server using Real-Time Messaging Protocol (RTMP) and Zoom recordings (cloud and local). An alternative option would be to build a bot or use a third-party service that can join meetings and record the audio, but these options would incur extra costs in the form of engineering effort or a subscription to a premium service.

For this tutorial, you'll stream the audio using RTMP and save the audio locally as an MP3 file. However, depending on your use case in real life, you might want to store the audio on the cloud.

To stream audio from a Zoom call to a custom media server using RTMP, you must have a Pro plan or higher on Zoom. Additionally, you need to enable livestreaming for meetings on your account.

Next, you'll need a custom media server to receive the audio stream from your Zoom call.

Create a custom media server

To create a custom media server, start by making a new src folder in your project's root directory. Then, create a mediaServer.js file in your src folder with the code below:

// mediaServer.js
const NodeMediaServer = require("node-media-server");
const processAudioStream = require("./audioProcessor");

const config = {
  rtmp: {
    port: 1935,
    chunk_size: 60000,
    gop_cache: true,
    ping: 30,
    ping_timeout: 60,
  },
  http: {
    port: 8000,
    allow_origin: "*",
  },
};

const nms = new NodeMediaServer(config);

// Listen for the "prePublish" event to start processing the audio stream
nms.on("prePublish", (id, StreamPath, args) => {
  console.log(`Stream [${id}] is about to be published at path: ${StreamPath}`);
  processAudioStream(StreamPath);
});

nms.run();

This code sets up a Node.js server for a custom streaming service. It starts by importing the NodeMediaServer library and a processAudioStream function (you'll implement this later). Then, it creates a configuration object for your media server.

The config object contains the configuration for your rtmp and http servers.

The configuration for RTMP includes the following:

port: This option represents the port on which the RTMP server listens for traffic. 1935 is the default port.
chunk_size: This option specifies the size of the data chunks that the server sends over the network, and it's measured in bytes. In this configuration, the chunk size is set to 60,000 bytes.
gop_cache: Setting gop_cache to true enables the server to cache each stream's latest group of pictures (GOP).
ping: This option specifies the interval, in seconds, at which the server sends RTMP ping messages to connected clients to keep the connection alive and measure the round-trip time. In this configuration, the server will send a ping every thirty seconds.
ping_timeout: This is the timeout (in seconds) for RTMP ping responses. The server closes the connection if the client doesn't respond to a ping message within this time. In this configuration, the timeout is set to sixty seconds.

The configuration for the HTTP server includes the following:

port: This is the port number on which the HTTP server will listen for traffic. In this configuration, the port is set to 8000.
allow_origin: This setting specifies which origins can make cross-origin requests to the server. The wildcard * is used in this configuration, meaning the server will accept cross-origin requests from any origin.

After setting up the configuration object, it creates an instance of NodeMediaServer and listens for the prePublish event. On the prePublish event, it calls the processAudioStream function and passes StreamPath as an argument.

You've now set up your custom streaming service. In the next step, you'll use FFmpeg to parse the audio from the RTMP stream.

Parse audio from an RTMP stream

RTMP streams typically contain both audio and video data multiplexed together. To use the audio data separately, you need to extract it from the combined stream and store it. For this tutorial, you'll store the extracted audio in an MP3 file.

To ensure that each audio file has a unique name, you can create a utility function that combines the current date (YYYY-MM-DD) and the timestamp to generate unique file names.

Create a utils.js file in your src folder and add the following code to it to implement the logic above:

function getFormattedDateTime() {
  const now = new Date();
  const year = now.getFullYear();
  const month = String(now.getMonth() + 1).padStart(2, "0");
  const day = String(now.getDate()).padStart(2, "0");
  const timestamp = Date.now();
  return `${year}${month}${day}_${timestamp}`;
}

module.exports = { getFormattedDateTime };

Next, create an audioProcessor.js file in your src folder and add the code block below to implement the logic for parsing the audio and saving the parsed audio to an MP3 file:

const ffmpeg = require("fluent-ffmpeg");
const { getFormattedDateTime } = require("./utils");

function processAudioStream(streamPath) {
  const inputPath = `rtmp://localhost:1935${streamPath}`;
  const outputPath = `./src/recordings/meeting_${getFormattedDateTime()}.mp3`;

  ffmpeg(inputPath)
    .outputOptions("-q:a 0") // Set the audio quality to the highest quality
    .outputOptions("-map a") // Map only the audio streams
    .on("start", (commandLine) => {
      console.log("Spawned FFmpeg with command: " + commandLine);
    })
    .on("progress", (progress) => {
      console.log("Processing: " + progress.timemark + "...");
    })
    .on("error", (err, stdout, stderr) => {
      console.log("An error occurred: " + err.message);
      console.log("FFmpeg stderr: " + stderr);
    })
    .on("end", () => {
      console.log("Processing finished!");
    })
    .save(outputPath);
}

module.exports = processAudioStream;

The code above listens to an RTMP stream, extracts the audio, and saves it as an MP3 file with high quality. It also provides detailed logs for the FFmpeg start, progress, error, and end events.

Finally, start your server by running the command below:

node src/mediaServer

With this setup, once your media server receives an RTMP stream and emits the prePublish event, it will parse the audio from the stream and save it to an MP3 file.

Make your custom media server accessible with ngrok

Currently, your media server is hosted on localhost. You need to make it accessible to Zoom using ngrok.

To continue, you need to install ngrok and set it up on your operating system. Then, run the command below to get the location of your ngrok configuration file:

ngrok config check

cd into the directory returned by the command above, open your ngrok.yml file, and add the configuration below to run your TCP and HTTP ports:

tunnels:
  tcp_tunnel:
    proto: tcp
    addr: 1935
  http_tunnel:
    proto: http
    addr: 8000

You can now run ngrok with the command below:

ngrok start --all

You should see an output similar to the following in your terminal:

Save your forwarding URLs. In the image above, these are tcp://5.tcp.eu.ngrok.io:13156 (this will be your streaming URL) and https://c9f0-105-114-3-2.ngrok-free.app (this will be your livestreaming page URL).

Stream audio to your custom live streaming service

In an ongoing Zoom call, click the three dots on the bottom right of your screen and select Live on Custom Live Streaming Service:

You'll be presented with a form to fill in your livestreaming server's information.

The fields in the form include:

Streaming URL: The base URL where your stream should be transmitted, which is the forwarding URL for your RTMP server you saved earlier. Change the protocol in the URL from tcp to rtmp. Then, append /live to the end of the URL to specify the route where the stream is directed—for example, rtmp://5.tcp.eu.ngrok.io:13156/live.
Streaming key: A unique identifier the streaming server uses to recognize and manage incoming streams. It acts as a password or token that authorizes the stream being sent to the server. You can fill the streaming key with any value of your choice, such as ZOOM.
Live streaming page URL: The URL where you can view your stream. In this case, it's your HTTP forwarding URL with the route /admin/streams appended to it—for example, https://c9f0-105-114-3-2.ngrok-free.app/admin/streams.

Fill out the fields with their appropriate values and click the Go Live! button:

This redirects you to your custom streaming server:

https://i.imgur.com/fNjo6ok.jpeg

Your custom streaming service is now recording your Zoom call audio and will save it to an MP3 file once your call ends or you end the livestream.

Transcribe audio with AssemblyAI

Now that you have your Zoom audio, you can get the transcript using AssemblyAI. To achieve this, you need to initialize the AssemblyAI SDK using your API key.

Create an assemblyai.js file and add the code block below to initialize your SDK:

require("dotenv").config();
const { AssemblyAI } = require("assemblyai");

// Create a new AssemblyAI client
const client = new AssemblyAI({
  apiKey: `${process.env.ASSEMBLYAI_API_KEY}`,
});

module.exports = client;

To transcribe a local audio file using AssemblyAI, call the transcribe method on client.transcripts and pass the file path as an argument.

Here's a standalone example:

const client = require("./assemblyai");

const transcribeAudio = async (filePath) => {
  const transcript = await client.transcripts.transcribe({ audio: filePath });

  console.log(transcript.text);

  return transcript;
};

transcribeAudio("src/recordings/FILE_PATH_TO_AUDIO");

In this code, the transcribeAudio function takes a file path as an argument and returns a transcript. If your audio file is stored on the cloud, you can pass the audio URL as an argument to transcribe it. Running the code will return a full transcript of the audio whose location you passed as an argument.

Analyze audio with AssemblyAI's large language and audio intelligence models

AssemblyAI provides an API called LeMUR, which allows you to apply large language models (LLMs) to spoken data. LeMUR enables you to perform various tasks on your audio data, such as asking questions about the data, summarizing the audio data, generating content based on the audio data, and much more.

The following example uses a recorded meeting about the "paradox of poverty." You can use LeMUR to summarize it, and you can also add a context option to provide information not explicitly referenced in the audio data and specify an answer format:

const client = require("./assemblyai");

const summarizeAudioWithLeMUR = async (filePath) => {
  const transcript = await client.transcripts.transcribe({ audio: filePath });

  const { response } = await client.lemur.summary({
    transcript_ids: [transcript.id],
    context: "A talk on the paradox of poverty",
    answer_format: "bullet points",
  });

  console.log(response);

  return response;
};

summarizeAudioWithLeMUR("src/recordings/meeting_20240603_1717405934474.mp3");

The code block above will return the summary of the audio file in bullet points.

You can also send any prompt of your choice to the LLM and apply the model to your transcribed audio using the LeMUR task endpoint.

You can send prompts to ask specific questions about your audio data, like "How many times was the word 'poverty' used in the meeting?" or "List the examples of welfare traps."

For example, the code block below demonstrates how you can use a prompt ("What is the main idea of the talk?") to query your audio data to get the main idea of the audio:

const analyzeAudioWithLemurTask = async (filePath, prompt) => {
  const transcript = await client.transcripts.transcribe({ audio: filePath });

  const { response } = await client.lemur.task({
    transcript_ids: [transcript.id],
    prompt,
  });

  console.log(response);

  return response;
};

analyzeAudioWithLemurTask(
  "src/recordings/meeting_20240603_1717405934474.mp3",
  "What is the main idea of the talk?"
);

The code block above will return a response detailing the main idea of the audio passed as an argument.

Audio intelligence models

AssemblyAI also provides various audio intelligence models that can perform multiple tasks such as content moderation, personally identifiable information (PII) redaction, and sentiment analysis, among others.

Let's take an example of PII redaction on your audio transcript to minimize sensitive information about individuals by automatically identifying and removing them. To enable PII redaction, you need to add a few configuration parameters. The following code block is a modified version of the transcribeAudio function with the required parameters:

const transcribeAudioWithPIIRedaction = async (filePath) => {
  const transcript = await client.transcripts.transcribe({
    audio: filePath,
    redact_pii: true,
    redact_pii_policies: [
      "banking_information",
      "phone_number",
      "email_address",
    ],
    redact_pii_sub: "hash",
  });

  console.log(transcript.text);

  return transcript;
};

transcribeAudioWithPIIRedaction(
  "src/recordings/meeting_20240603_1717405934474.mp3"
);

The parameters you added here are as follows:

redact_pii: This is a Boolean value that, when set to true, enables PII redaction in the transcript.
redact_pii_policies: This is an array of information to be redacted. In the example above, you have banking information, phone numbers, and email addresses.
redact_pii_sub: This refers to what the redacted information should be replaced with. In the example above, the redacted information will be replaced with hashes (###).

The code block will return a transcript with all the specified PII redacted.

You can find the code used in this tutorial in this GitHub repository.

Conclusion

In this article, you learned how to extract audio data from a Zoom call using RTMP, Node Media Server, and FFmpeg. You also learned how to transcribe the audio using AssemblyAI's transcription models, apply LLMs to it using LeMUR, and manipulate the audio transcriptions using audio intelligence models.