Various root causes lead to disruptions of cloud services, and to critical events which take it down.
Various root causes lead to disruptions of cloud services, and to critical events which take it down. These occur as a result of software, hardware, and infrastructure issues, despite cloud providers’ heavy investment in data-center resilience, which is a top priority for them.
For example, reported root causes for events suffered by AWS include power outages and bugs triggered by traffic overloads. Azure and GCP have reported causes including issues with physical infrastructure, insufficient resources due to increased traffic, the recent rollout of new services, and the misconfiguration of systems during routine maintenance. The most commonly reported root cause of outage events in 2022 was human error, which includes misconfiguration and faulty maintenance activity.
Root causes of critical outage events in 2022
Power: Loss of power is a common physical cause of outages. Large data centers consume about the same amount of energy as 80,000 homes. Damage to the grid or to generation plants impacts everything that relies on them, including data centers. Like many businesses, they typically have backup generators able to meet at least part of a data center’s electricity demand in case of a power failure.
Overload: System overload occurs when demand for the services provided by an individual data center exceeds its capacity to supply services. Overloads are typically caused by sudden spikes in traffic, by infrastructure downtime, or by a malicious DDoS attack. When a data center has reached its capacity, users’ systems may display error codes, delay requests, or deliver only partial content.
Physical Infrastructure: Damage to the physical infrastructure of the digital supply chain could be caused by direct damage to a data center (perhaps resulting from a natural disaster or weather- related event), or through the region’s power grid. Since such events are local, it is nearly impossible for physical infrastructure disruptions to cripple multiple global regions at once. However, a major event like an earthquake could impact an entire region.
Connectivity: Connectivity issues such as internet interference constitute the most ambiguous reason for cloud downtime. Such failures are usually beyond the control of the cloud provider, and therefore difficult to forecast, especially as the largest cloud vendors balance workloads across geographically separated data centers.
Human error: People are the most common reason why the cloud goes down. Human error is the only root cause that falls both into the physical and the software categories of causes. It could be that a developer entered the wrong configurations or an incorrect command, or that an operator accidentally turned off the air conditioning. People can make a host of mistakes which bring down the entire IT service - and they often do.
Understanding the risk of cloud downtime is the first step in making sure it doesn’t bring your business to a costly halt. This post is the seventh and final in a series about managing cloud outage risks in the Digital Supply Chain. You can read more about it in the Parametrix report revealing the details of cloud downtime among the three major providers – Amazon Web Services, Google Cloud, and Microsoft Azure.