Summary
On Tuesday, August 31st, AWS had an outage in their us-west-2 region. At 18:00 UTC that day, we experienced an increase in 5xx error codes returned by our API, as well as a slowdown in transcription turnaround time. The AWS outage impacted a single AWS availability zone, usw2-az2, which is us-west-2a in our AWS account (it may be different for yours). We would like to take this opportunity to share our post mortem of this event.
We hope AWS provides a Post-Event Summary soon.
Impact
API 5xx Responses
We saw a significant increase in the number of 5xx responses returned to users making API calls.
For reference, normally about 0.01-0.02% of responses from our API are 5xx status codes. During the height of this incident, 3% of responses from our API were 5xx status codes.
Most of these 5xx responses originated from the AWS Application Load Balancer in front of our API, not our application code. For more information on the metrics in the image below, check out the AWS docs. In hindsight, we should have disabled the us-west-2a AZ in our ALB configuration.
In green, 5xx errors returned by the AWS Application Load Balancer. In red, 5xx errors returned by our API.
We determined that the 5xx errors returned by our application were caused by both an issue with the Network Load Balancer in front of PGBouncer, which we use to pool connections to our database, and packet loss within the us-west-2a region. Our PGBouncer containers flapped between unhealthy and healthy, which caused the NLB to frequently stop routing traffic to PGBouncer containers in us-west-2a.
According to the information that AWS has released about the outage, the behavior we saw with PGBouncer was caused by the following:
A component within the subsystem responsible for the processing of network packets for Network Load Balancer, NAT Gateway, and PrivateLink services became impaired. It was no longer processing health checks successfully.
Transcription Slowdown
From 18:00 - 18:20 UTC, we saw a significant increase in the time it took us to process a transcript. There were several causes for this issue.
We track what we call "true crunch", which tells us how long it took us to transcribe a file as a percent of the length of the file. For example, if it takes us 30 seconds to transcribe a 60-second long file, the crunch on that file is .5, or 50%. True crunch takes into account the time a given file was sitting in the transcription queue, as well. We aim to keep true crunch well under 30%.
At 18:04 UTC, we started to see a massive spike in true crunch. The times shown in the below graph are in Mountain Time. For reference, the Y-axis of this graph * 100 is the crunch percent.
We also saw massive spikes in our DynamoDB query times. We use DynamoDB as a metadata-rich queue for our transcription pipeline. Usually, our query time sits around 0.5 seconds. Slow DynamoDB query times cause significant slowdowns to our transcription pipeline.
The time items spent queued in DynamoDB increased during this incident. We call this "Wait time" and it usually averages out around 0.5 seconds. During this incident, our average "Wait time" spiked very, very high (shown in the first image below). Our max wait time (the second image) peaked at 262 seconds. While we started to see recovery starting around 18:20 UTC, we still saw spikes in "Wait time" during the entire AWS outage. We attribute this to issues with the AWS network and AWS services - AWS' incident updates mentioned impacts to many key AWS services, including Kinesis. If our experience with AWS outages tells us anything, it's that Kinesis issues have a wide blast radius.
Similarly, some of our SQS queues started to back up. We use SQS for a handful of our transcription services, such as punctuation and categorization, which poll SQS queues for work.
Keep in mind that true crunch is a holistic metric, so a slowdown anywhere in our transcription pipeline impacts this metric. Multiple pieces of our transcription pipeline were impacted by the AWS outage. We bore the brunt of the impact in the first 20 or so minutes of the outage but continued to see intermittent slowdowns for several hours.
Transcription Errors
We saw a relatively large increase in transcript failures. We categorize transcription failures as either client_error or server_error. A client_error means we experienced an issue that prevented us from transcribing a file at all, such as an issue downloading the source audio file. A server_error means something went wrong within our transcription pipeline. The transcription failures we saw were almost entirely client_error, which we attribute to AWS network issues causing file download failures.
Detection
According to AWS, this incident started at 12:46 MDT. However, we saw issues with Kinesis around 8:48 MDT.
We became aware that something was off when we received an alarm for the queue depth of our punctuation service. At the time we received the alarm, the SQS queue had a backlog of 173 messages. That was an eye-catching number, so we decided to check Datadog. If we have a high message backlog, but messages are being picked up quickly, then everything will straighten itself out without any noticeable impact to our users. However, the "message pickup time" metric that our workers publish (which we calculate by looking at the SentTimestamp message attribute) was shockingly high. This means that messages are sitting in the queue for far too long.
A few minutes later, our PGBouncer containers started to become unhealthy. We implemented PGBouncer in March and haven't had any issues with LB targets becoming unhealthy.
We checked the real AWS status page (Twitter) and saw others were having issues.
Response
About 10 minutes after the first queue depth alarm, we started to take action to stabilize our platform.
The first step we took was the increase the number of workers for our backed-up queues. Below is a look at the number of running workers for our punctuation service. Normally we run between 14 and 20 workers to handle our peak traffic period. We manually scaled this service up to 50 workers to quickly work through the queue and hopefully prevent another backlog. We took similar steps for a few of our other services.
With PGBouncer in a relatively unstable state, we added more PGBouncer containers. We didn't see much benefit from this, unfortunately.
Recovery
We started to see two key metrics, true crunch and message pickup time (for all queues), recover around 18:20 UTC. Given that our issues were caused by an AWS incident, we were in the frustrating position of having to wait for AWS to implement a fix.
Around 22:10 UTC, we saw 5xx responses from our load balancer drop back to normal.
Lessons Learned
There are 4 availability zones in the us-west-2 region. Our ECS cluster, where most of our services run, makes use of 3 of these AZs. We plan to make use of the fourth availability zone in the coming weeks. This will make us more resilient to issues within a single availability zone.
Two of our core services run on EC2 and are managed by a custom orchestrator. Workers for these services are launched using the RunInstances API. The orchestrator does not specify a subnet when launching instances, so AWS will "choose a default subnet from your default VPC for you". When we looked at these instances, almost all of them were in the impacted availability zone.
We already have a tech debt task to move our production workloads out of the default VPC in order to follow best practices. We are going to add randomized subnet selection into our orchestrator. Further, we plan to move these two core services onto ECS in the coming weeks. Migrating these services to ECS will give us better spread across AZs.
Wrapping Up
When this incident started, we were happy to see our efforts towards platform resiliency pay off - many of our users didn't feel the impact of the AWS outage. That said, some mistakes on our end were a big footgun. We aim to have these mistakes rectified in the next week.
If you're interested in joining the AssemblyAI team, check out our open jobs here! https://apply.workable.com/assemblyai/