AWS: Here’s what went wrong in our big cloud-computing outage

0
211

Liam Tung

Written by

Liam Tung, Contributor

Liam Tung

Liam Tung
Contributor

Liam Tung is an Australian business technology journalist living a few too many Swedish miles north of Stockholm for his liking. He gained a bachelors degree in economics and arts (cultural studies) at Sydney’s Macquarie University, but hacked (without Norse or malicious code for that matter) his way into a career as an enterprise tech, security and telecommunications journalist with ZDNet Australia.

Full Bio

on December 13, 2021

| Topic: Enterprise Software

Managing the Multicloud

Watch Now

Amazon Web Services (AWS) rarely goes down unexpectedly, but you can expect a detailed explainer when a major outage does happen. 

The latest of AWS’s major outages occurred at 7:30AM PST on Tuesday, December 7, lasted five hours and affected customers using certain application interfaces in the US-EAST-1 Region, including major customers like Disney+, Venmo, Robinhood, and others. In a public cloud of AWS’s scale, a five-hour outage is a major incident.

AWS control planes are used to create and manage AWS resources. These control planes were affected as they are hosted on the internal network. So, while EC2 instances were not affected, the EC2 APIs customers use to launch new EC2 instances were. Higher latency and error rates were the first impacts customers saw at 7:30AM PST. 

SEE: Cloud security in 2021: A business guide to essential tools and best practices

With this capability gone, customers had trouble with Amazon RDS (relational database services) and the Amazon EMR big data platform, while customers with Amazon Workspaces’s managed desktop virtualization service couldn’t create new resources. 

Similarly, AWS’s Elastic Cloud Balancers (ELB) were not directly affected but, since ELB APIs were, customers couldn’t add new instances to existing ELBs as quickly as usual.   

Route 53 (CDN) APIs were also impaired for five hours, preventing customers changing DNS entries. There were also login failures to the AWS Console, latency affecting Amazon Secure Token Services for third-party identity services, delays to CloudWatch, and impaired access to Amazon S3 buckets, DynamoDB tables via VPC Endpoints, and problems invoking serverless Lambda functions.   

The December 7 incident shared at least one trait with a major outage that occurred this time last year: it stopped AWS from communicating swiftly with customers about the incident via the AWS Service Health Dashboard. 

“The impairment to our monitoring systems delayed our understanding of this event, and the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region,” AWS explained. 

Additionally, the AWS support contact center relies on the AWS internal network, so staff couldn’t create new cases at normal speed during the five-hour disruption.

AWS says it will release a new version of its Service Health Dashboard early 2022, which will run across multiple regions to “ensure we do not have delays in communicating with customers.”

Cloud outages do happen. Google Cloud has had its fare share and Microsoft in October had to explain its eight-hour outage. While rare, the outages are a reminder that public cloud might be more reliable than conventional data centers, but things do go wrong, sometimes catastrophically, and can impact a wide number of critical services. 

“Finally, we want to apologize for the impact this event caused for our customers,” said AWS. “While we are proud of our track record of availability, we know how critical our services are to our customers, their applications and end users, and their businesses. We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.”

Enterprise Software

Windows 11: How to get Microsoft’s free operating system update

The best Linux distros for beginners in 2021

Windows 10 is a security disaster waiting to happen. How will Microsoft clean up its mess?

AWS embraces Fedora Linux for its cloud-based Amazon Linux

Cloud

|
Big Data Analytics

|
Innovation

|
Tech and Work

|
Collaboration

|
Developer