November 19, 2024

How we built our AI Lakehouse

Learn how we built our AI data Lakehouse to allow for rapid research iteration while maintaining cohesive, secure, and deduplicated datasets.

Ryan O'Connor

Senior Developer Educator

Ahmed Etefy

Ryan O'Connor

Tech Lead - Data Infrastructure

Senior Developer Educator

Ahmed Etefy

Tech Lead - Data Infrastructure

Ryan O'Connor

Senior Developer Educator

Table of contents

[Visible on live site]

Get $50 in credits

In a world where data is a crucial asset for training AI models, we've seen firsthand at AssemblyAI how properly managing this vital resource is essential in making progress toward our goal of democratizing state-of-the-art Speech AI. In the course of developing our Conformer and Universal speech recognition models, we've had to navigate the complexities of handling massive amounts of audio data and metadata. As our data needs grew, so too did the accompanying challenges, such as fragmentation, bottlenecks, and limited accessibility.

That's where our AI Lakehouse comes in. This project is more than just a technical upgrade; it's a game-changer for how we handle data. By unifying our datasets and empowering teams to take charge of their data's quality, our AI Lakehouse is turning the challenges associated with managing large volumes of data into opportunities for our entire organization to effectively leverage this data.

In this post, you'll see under the hood of our AI Lakehouse project, discovering how we've transformed our data infrastructure into a strategic asset and set the stage for our next breakthroughs.

Navigating the Challenges of Rapid Data Growth

At AssemblyAI, we handle massive amounts of unstructured data, primarily in the form of audio, as well as structured data like labels and metadata. Over the past few years, we have significantly scaled our data operations. For example, our latest Universal-2 speech recognition model was trained on more than 12 times as much data as our speech recognition model from two years ago.

This rapid growth in data volume has introduced a new set of challenges, from managing organizational overhead to ensuring the utility and accessibility of our vast datasets. In addition to these volume-related challenges, the expansion of our research and engineering teams has led to increased requirements for data governance, quality control, and discoverability. Each team has its own unique data needs and workflows, contributing to an increasingly complex and diverse data ecosystem.

The Role of Data Infrastructure

Our Data Infrastructure team plays a crucial role in this ecosystem and is responsible for building the infrastructure and tools that empower AssemblyAI to leverage data effectively—whether it be for training, benchmarking, or other data-driven operations critical to business decisions. Until earlier this year, all requests for acquiring datasets and storing them in an accessible format had been routed through the Data Infrastructure team, making them a central point of contact for data-related needs across our organization.

The Problem

Different teams across our organization—both technical and non-technical—often have unique data requirements that require specific subsets of audio data combined with various classes of metadata.

Prior to the introduction of our AI Lakehouse, the lack of an established self-serve process for data consolidation led to various teams attempting to self-serve their data needs. This resulted in significant data fragmentation, making it difficult to navigate and leverage our data assets effectively. In an effort to address this issue, the Data Infrastructure team stepped in to consolidate the datasets through a centralized ingestion and review process. While this approach provided some improvements by at least centralizing dataset ingestion, it ultimately created a bottleneck for all other teams, as they had to rely on the Data Infrastructure team for their data requirements.

The Data Infrastructure team was faced with the following key challenges:

Limited Access and Bottleneck Effect: Access to datasets was restricted, and all requests had to go through the Data Infrastructure team. This created a bottleneck as the team became inundated with ad hoc requests from other teams, slowing down the team’s ability to develop the platform that would ultimately resolve these issues. Urgent requests took precedence over long-term solutions, perpetuating and compounding the problem.
Data Fragmentation, Duplication, and Limited Query Capabilities: As the organization grew, it became increasingly difficult to locate datasets within the fragmented ecosystem. This often led to the duplication of datasets across our Google Cloud Platform (GCP) project due to a lack of proper data governance. Moreover, the ability to query across all datasets simultaneously and have visibility into the entirety of our data was severely limited, slowing down the research and development process.
Lack of Dataset Utility Awareness: We lacked a well-defined process for assessing the utility and relevance of different datasets, making it challenging to prioritize and manage them effectively.

Enabling Organizational Efficiency: Enter the AI Lakehouse

To address these challenges, we introduced our internal AI Lakehouse solution. The primary goals of this project were threefold:

Democratize Data Access with Security and Compliance: We sought to make data readily accessible across the organization, allowing teams to find and utilize the datasets they need without unnecessary roadblocks. At the same time, we had to ensure that this increased accessibility did not compromise security or compliance policies. Our approach includes robust access controls and adherence to strict data governance policies, ensuring that, while data is more readily available, it remains secure and compliant with our robust security policy.
Consolidation of Datasets Across the Organization in a High-Quality Manner: Consolidating datasets was a crucial step forward, but consolidation alone is not sufficient. To truly maximize dataset utility, consolidation must be paired with organization and quality filtering. Consolidation alone is analogous to gathering clothes strewn about the house and stuffing them into a single drawer. While this achieves consolidation, it does not provide the utility achieved from folding these clothes and applying some organizational structure by which they are stored. Instead, by folding and organizing shirts in one shelf and jeans in another, the utility of the wardrobe is greatly enhanced. Similarly, our goal is to ensure that consolidated datasets are well-organized and of high quality, thereby increasing their utility across the organization.
Shift Dataset Quality Responsibility to the Requester: We are transitioning the responsibility for organizing and managing datasets from the Data Infrastructure team to the dataset requesters themselves. In the past, datasets often came with poor or incomplete metadata, requiring the Data Infrastructure team to clean and enhance the data to ensure its utility. By shifting this responsibility to the requesters, we not only reduce the bottleneck but also encourage teams to take ownership of the quality and usability of their data. This approach fosters a more efficient and scalable data management process, where data is delivered in a ready-to-use state, tailored to the specific needs of the requesters.

High-Level Architecture Diagram

Figure 1. High Level Architecture Diagram showcasing our Storage Layer, and our Cache Layer.

The design of our AI Lakehouse is intended to efficiently manage, store, and serve large volumes of data, offering fast access and robust analytics capabilities. It not only supports application use cases like model training and benchmarking but also facilitates datasets discovery and the execution of analytical queries across all datasets.

Storage Layer

For blob storage (mainly for audio files), we opted for Google Cloud Storage (GCS) due to its favorable cost-to-performance ratio, which made sense for our needs. Regarding metadata storage, we chose Bigtable for several key reasons. First, many of our datasets lack fixed schemas, and often, different datasets have unique columns, so a key-value store like Bigtable was a natural fit. Our previous experience with Bigtable for serving training workloads demonstrated that it was both fast and reliable.

We also appreciated the ability to control the costs associated with serving data from BigTable tor applications, avoiding the pay-per-byte model of services like BigQuery, where costs could escalate with data scans. Bigtable's setup process was straightforward, and it offered additional benefits that met our requirements. For instance, Bigtable’s support for multiple or historical values per cell was crucial for us, and the concept of column families allowed us to implement customized garbage collection policies. Another advantage was the ability to create authorized views, with Bigtable managing these securely on our behalf.

Cache Layer

Nearly all of our teams are well-versed with the BigQuery interface because of its ease of use and powerful performance when executing SQL queries. We recognized that by keeping BigQuery as our primary interface, we would significantly increase adoption speed. This approach meant that instead of asking our teams to learn a new tool, we could enhance their existing experience by consolidating all datasets into a unified backend with which they were already familiar.

However, a challenge arose when we had to figure out the best way to integrate our metadata from BigTable into BigQuery. We encountered several competing ideas, each with its own merits. The goal was to leverage BigQuery's familiar interface while ensuring seamless access with BigTable as our the metadata backend to our newly consolidated datasets, ultimately enhancing both usability and performance.

Our initial three main contenders for integrating metadata into BigQuery were:

Dual Write to Bigtable and BigQuery
Use BigQuery External Table on Top of Bigtable
Use Bigtable CDC (Change Data Capture) Stream to BigQuery

Dual Write to BigTable and BigQuery

This approach involved writing data simultaneously to both Bigtable and BigQuery. The advantage here was the immediate availability of data in both storage layers, allowing for real-time access and analytics without additional processing steps. However, managing consistency between the two systems posed a potential challenge and risk.

Use BigQuery External Table on Top of Bigtable

In this scenario, we considered creating BigQuery External Tables that would directly interface with the data stored in Bigtable. This method offered a seamless way to query data without duplicating storage. However, we quickly identified some significant concerns, particularly around performance. Since the data would be accessed across different storage systems, complex queries could suffer from slower response times, which, in turn, could reduce user adoption if the experience didn't meet expectations.

Additionally, we observed some quirks (although reasonable in hindsight) when querying external tables in Bigtable. Unlike native BigQuery tables, Bigtable—being a key-value store—may not return rows for keys that lack corresponding values, which can lead to differences in query results. These inconsistencies further complicated our decision-making process, as the reliability and predictability of query results are critical for user confidence and system adoption.

Use Bigtable CDC (Change Data Capture) Stream to BigQuery

This approach would allow us to stream changes from Bigtable to BigQuery in real-time, ensuring that BigQuery always had the most up-to-date data. The dynamic, near-instantaneous data syncing was particularly appealing, as it promised a highly responsive and efficient system. However, as we delved deeper, it became clear that this solution might be overpowered for our needs. We weren’t particularly interested in real-time updates for the cache layer; a refresh every 24 hours would have sufficed for our use case.

Additionally, this approach came with significant overhead. We would have had to manage the changelog table ourselves, as the CDC (Change Data Capture) stream emits an event for every cell change. This would require us to pivot the table, handle deletes, and manage the complexity of these updates—an effort that seemed disproportionate to our actual requirements. Given our goal of maintaining a simple and efficient cache layer, the management overhead associated with this solution appeared unnecessary and more complex than what we needed to achieve.

The Solution We Chose

In the end, we opted for a fourth alternative solution that is straightforward yet effective: leveraging BigQuery Scheduled Queries to create a BigQuery native table from the Storage Layer every 24 hours. This table includes only the dataset fields that are most relevant for analysis, such as language and duration.

This approach significantly reduced our operational overhead while still meeting our performance expectations with BigQuery. It also allowed us to maintain cost efficiency by avoiding the complexities and expenses associated with real-time updates or managing a changelog table. By focusing on the essential data and updating it daily, we struck a balance between simplicity, performance, and cost-effectiveness, providing our teams with the reliable, high-performance, familiar access they needed for their analyses while still retaining heavy-duty, detailed tables should the need for higher resolution queries arise.

A Highlight in Simplicity: The Looker Dashboard

After investing significant time and effort into designing a robust system architecture and ensuring top-tier security, it was somewhat surprising to see what garnered the most attention within the organization: a Looker dashboard. While we dedicated extensive resources to developing a complex and secure backend, it was this dashboard—created in under ten minutes—that became the most visible and celebrated element of the team’s efforts and work.

We integrated the dashboard into our internal systems and it quickly became one of the most frequently accessed resources within the company. Despite its simplicity, the dashboard effectively encapsulated the essence of our project and resonated strongly with users across the organization. This experience serves as a reminder that sometimes, the most straightforward solutions can have the most significant impact.

Figure 2. AssemblyAI Lakehouse Looker Dashboard. The numbers in these charts are randomly generated for visualization purposes.

Dataset Ingestion Process: How Data Enters the AI Lakehouse

Figure 3. Illustration of the dataset ingestion process graph

Another critical aspect of our AI Lakehouse is the ingestion process—how data actually makes its way into the AI Lakehouse. We’ve designed this process to be efficient, scalable, and user-friendly, allowing teams to easily contribute new datasets while maintaining the integrity and quality of our data environment.

The Ingestion Workflow

The process begins with both the data blobs and their associated metadata being placed in a staging area. From there, we’ve developed a lightweight data flow pipeline—built using Google’s Dataflow—to handle the initial processing steps. This pipeline performs a few key tasks: computing some basic metrics, validating the data, formatting it correctly, and routing each data artifact to its intended final destination within the AI Lakehouse.

Shifting Responsibility to the Requestor

One of our main goals with this ingestion pipeline was to shift the cost and responsibility of ingesting new datasets from our central Data Infrastructure team to the requestors themselves. However, during internal testing, we realized that fully exposing this process to the organization presented an opportunity to implement a review process similar to code reviews in software development. By consciously adding this layer of oversight, we can ensure data integrity, maintain quality standards, and prevent potential issues that could arise from unchecked data modifications.

The Review Process

To implement this, we created a Dataset Review Dashboard. Here’s how it works:

Requestor Documentation: The person or team requesting the ingestion of a new dataset is provided with lightweight documentation on how to format their dataset and add it to the staging area.
Composer DAG Execution: Once the dataset is in the staging location, a Composer Directed Acyclic Graph (DAG) is triggered by the requestor, carrying information about the dataset throughout the DAG workflow nodes:
1. Review Request Submission: The node submits a request to our Dataset Review Service, prompting a reviewer to assess the dataset.
2. Reviewer Notification: A reviewer is notified, and the node continues to poll the status of the review request until it reaches a terminal state (approved or rejected).
3. Feedback Mechanism: If the dataset is rejected, the reviewer provides feedback on why it didn’t meet the criteria, similar to how code reviews focus on quality and style, and the requestor is notified of the requested changes.
4. Approval and Ingestion: Once approved, the Ingestion Dataflow Pipeline is triggered, and the dataset is officially ingested into our AI Lakehouse.

This review process ensures that all datasets entering the AI Lakehouse meet our standards for quality and format, helping to maintain the integrity of our data ecosystem.

Measuring Success: How Do We Know It Worked?

When we set out to build the AI Lakehouse, we knew that success would be measured by how well it served our teams and simplified our data operations. To gauge its impact, we focused on a few key indicators that would tell us if we were truly making a difference.

1. Fewer ad hoc Support Requests

Since implementing the AI Lakehouse, we have observed a dramatic drop in ad hoc data-related support requests. Since rolling out the AI Lakhouse, those requests have virtually disappeared for datasets that have been fully migrated. This implies that our teams are finding what they need without having to ask for help, which is exactly what we hoped to achieve.

2. More Activity on the Cache Layer

Another big win has been seeing the increase in query activity on our BigQuery Cache Layer. We’ve been using BigQuery Job History to track how often and how heavily the cache is being used. The uptick in usage shows that teams are relying on the Cache Layer for their analytical work, which means it’s performing well and meeting their needs.

Figure 4. Table showing users' engagement with the AI Lakehouse in terms of query count

3. Saying Goodbye to Legacy Systems

One of the most tangible outcomes has been getting rid of some old, cumbersome systems—specifically, the CSVs we used to rely on for benchmarking datasets. Migrating these datasets to the AI Lakehouse allowed us to finally ditch those outdated files and streamline our workflows, both reducing our technical debt and making our data processes more efficient.

4. Seeing Teams Build on the AI Lakehouse

We've begun to see other teams start to build new data products on top of the AI Lakehouse. For example, a different team is working on an upgrade to our benchmarking tool, and they’ve been using the AI Lakehouse as their foundation. We’ve been collaborating with them on feature requests, and it’s clear that the AI Lakehouse is becoming the backbone for new innovations.

5. Positive Feedback and Increased Usage

We’ve also been collecting feedback from different stakeholders and paying attention to how teams are using the AI Lakehouse. The feedback has been overwhelmingly positive, with people noting how much easier it is to access and use data now. And the usage analysis confirms that more and more teams are turning to the AI Lakehouse for their data needs.

Start Building with AssemblyAI

Lessons Learned

Avoid Premature Optimization

One of the biggest takeaways from this project was the importance of not prematurely optimizing solutions. When we were architecting our BigQuery cache layer, the simplest and most efficient solution was right in front of us, yet we were too focused on finding the "perfect" solution from the start. Had we begun with the external table approach and iterated on it, we might have saved significant development time and countless brainstorming sessions.

Sometimes, the most straightforward solution is the best starting point. Optimization can come later, as you better understand the problem and the system's requirements.

Focus on the UI

Another critical lesson was the undeniable importance of user interfaces (UIs). As one of our contributors pointed out, “Building a data product and talking about it is one thing, but seeing it is a completely different thing.” The impact of a well-designed dashboard cannot be overstated.

Often, the work of backend or system engineers remains in the background unless it’s brought to life through a compelling UI. For data teams or those working closely with data, the dashboard is that interface—it’s what enables others to see and interact with the data that backend/system engineers work hard to manage. In our case, the Looker dashboard was a prime example of how a simple, visually appealing UI can elegantly communicate the value of backend work across an organization.

Shift Responsibility

By moving the initial burden of dataset ingestion to the requestors, we’ve empowered teams to take ownership of their data, aligning with our broader goal of decentralizing data management.