Amazon Web Service's (AWS) largest US cloud region (US-East-1) experienced an outage for 2 hours and 48 minutes, bringing enterprise giants down with it
On June 13 2023 at 18:49 UTC, 13:49 EST, the leading cloud service provider, Amazon Web Services (AWS) experienced a significant downtime incident in the US-EAST-1 region.
According to the Parametrix Cloud Monitoring System (PCMS) the main impacted service was Lambda which caused direct errors for customers using Lambda and for additional services such as API Gateway.
AWS have since traced the root cause of the outage to a subsystem responsible for capacity management in AWS Lambda. The figure below shows us that from 18:49 UTC to 21:37 UTC, 13:49 - 16:37 EST, Lambda experienced a sudden drop in success rate and became unavailable for a duration of 2 hours and 48 minutes.
It wasn't until 20:48 UTC, 15:48 EST, that AWS officially acknowledged the incident, confirming that the event had indeed begun at 18:49 UTC, 13:49 EST. Here are some notable highlights:
As the primary service affected by the issue, AWS Lambda experienced an extended outage. Despite slight improvements detected by the monitoring system at 20:42 UTC, 15:42 EST, full recovery was not achieved until 21:37 UTC, 16:37 EST, indicating the complexity of the underlying issue.
AWS API Gateway
Starting from 18:49 UTC until 20:42 UTC, 13:49 - 15:42 EST, the monitoring system observed a sharp increase in errors on the client side, indicating that API Gateway in the US-EAST-1 region became unresponsive. This disruption caused significant inconvenience for users relying on this critical service.
AWS Other Services
While AWS Lambda faced the brunt of the outage, the monitoring system also detected interruptions in multiple other AWS services.Customers have experienced authentication or sign-in errors when using the AWS Management Console, or authenticating through Cognito or IAM STS, as well as issues when attempting to initiate a Call or Chat to AWS Support.
Thousands of end-users in the US were flooding the web with reports and complaints of interrupted service.
As dependency on the cloud, and on this particular region in general, continues to rise, businesses must address the risk of cloud outages- by looking into redundancy measures, and disaster recovery strategies to minimize downtime and ensure reliable service delivery.
For more information on the persistence and impact of cloud outages on businesses today, view Parametrix latest report on cloud downtime: Managing Cloud Outage Risk.