Learn how the Parametrix monitoring team developed the right tools to effectively track and monitor cloud downtime events.
It started with a muffled alarm sound, during what we thought would be just another day. It indicated that an AWS service, on the region called US-EAST-1, was hiccuping. But very soon it was clear that it would be a long night for the Parametrix cloud monitoring team.
In this post we discuss our technology, methodology, the teams in charge of developing our cloud monitoring system, and how they all joined forces on the day of the big AWS cloud downtime event.
Parametrix’ technology was developed with a clear mission: to build a set of sensitive and far reaching systems to gauge SaaS, PaaS and IaaS robustness in real time and alert when they are down.
We gathered a team of cloud experts to build a complex web of monitors to find anomalies in the way the public cloud functions. We sought, and found, multi-cloud infrastructure specialists, who could digest massive amounts of data, billions of actions per second, with an affinity for networking protocols and a deep understanding of the cloud’s architecture and business use cases.
Reliance on the public cloud is constantly growing and the network becomes more complex by the quarter, with hundreds of different services on dozens of data centers across the globe. Keeping track of everything is almost a mission impossible. Almost.
The Parametrix cloud monitoring team put in two years of research and development, and ended up with a real-time, multi-cloud global monitoring operation. The system executes over 100K actions a minute, utilizing networking protocols and cloud APIs to prod all services at high intervals across all zones globally.
The network monitoring portion of the system analyzes how services on different cloud zones communicate with each other, and how they communicate with applications outside of the cloud providers’ networks. The result is a tool that can find any small deviation from normal cloud performance and availability, on any service, in any region or zone.
Back to December 7.
We received the first alert at 15:18 UTC, 10:18 EST indicating a problem on US-EAST-1 - AWS’ cardinal data center. We know from experience that the region is less stable than others, and that it hosts many customers. The alarm was triggered by a high error rate for the EC2 API - a service that allots compute power to customers to run their processes at scale.
Shortly after the first alert, AutoScaling triggered another alert, indicating the service that adjusts access capacity to compute services could not scale up or down. Many other services followed suit.
The 100K actions a minute executed by our monitoring system prod every service across every zone and network. We trigger and alert when we get bad responses, don’t get responses, or the responses are slower-than-usual to arrive. A critical mass of alerts is a sign that something on the cloud is severely wrong.
Now our teams were in full “war room” mode, analyzing the scope of the outage and finding answers to some important questions. Was the event limited to AWS? Was it limited to the single US-EAST-1 region? Were there other services that were interrupted, but below our own alarm threshold?
Four minutes after the first alert, Downdetector - a popular platform where users register downtime complaints, showed a sharp increase in reports of interrupted service on AWS. This confirmed our initial finding.
There was not yet a status page update from AWS. But over the next hour several companies relying on AWS put up their own status pages, acknowledging a service interruption.
AWS updated their status page after an hour and a half, longer than it usually takes them to disclose an interruption and provide context.
The root cause of the cloud downtime event was an impairment of several network devices in the US-EAST-1 region. The set of services that were interrupted rely heavily on networking, and services experienced intermittent availability or latency issues.
Early on, we understood the problem didn’t emanate from a specific geographical territory. It impacted anyone trying to access these services on US-EAST-1, regardless if originating the US, Europe, Asia or Africa.
Building a first-of-its-kind system is an adventure. You can easily lose focus and find yourself far adrift from your original plan. And there’s also the challenge of maintaining cost efficiency: conducting so many actions over a short time can be expensive.
But Parametrix came up with creative ways to simulate real cloud use cases, and to strike the right balance between conducting the right queries, at the right time intervals for every cloud service and network.
“It was a challenge to build a cloud-hosted system that won’t crash if the cloud fails,” Maayan Rabi, a member of our team summarized. “We had to learn to monitor the cloud from within, and to establish and define the narrow gap between a functioning and a degraded services. And even after we managed that, we are continuously improving the system’s resilience and efficiency.”
Downtime events on the public cloud happen frequently. What set this event apart was the amount of interrupted services, and the time it took AWS to fix them.
Problems on US-EAST-1 are nothing new. It is the region with the most downtime in AWS’ entire network. And to compound the problem, many companies rely on this region for their cloud services. It’s a double whammy in many ways. For years, it was the closest data center for US businesses east of the Mississippi. That was the case until AWS built US-EAST-2 in Ohio- a new default region. With time, we expect to see some the importance of US-EAST-1 shift over to US-EAST-2.
Still, today it’s the busiest AWS cloud region. One can only assume if this contributes to its higher-than-average downtime incidents rate.
Choosing a provider and a region is a task for experts, who know what interface and services best support business goals.
But there are a few guiding lights even a layman should seek when choosing a data center:.
Multi regional architecture and automatic redirection of traffic to alternative regions is useful as part of any contingency plan. But it adds to your costs. Consider whether the services you provide mandates such an effective, but costly measure.
If you have any questions about this post, or anything else, feel free to reach out to us at firstname.lastname@example.org