Over the past 30 days, Microsoft had two downtime incidents involving its cloud computing service. During the first one on March 15, Azure experienced errors for more than three hours, while the errors on the second one, which happened on April 1, lasted almost two hours. Their two main competitors – AWS and GCP – also had similar incidents over the past few months.
So how can a company with mission-critical services on Microsoft Azure (or any other cloud service) avoid the serious impact of such events?
I have three words for you – Stability, Redundancy and Insurance – all of which will be discussed in this blog post.
You should immediately act on at least one of these, but ideally you should combine all three – purchase parametric insurance that covers this type of event while also changing the cloud settings to ensure your Azure service is both stable and redundant.
[ Downtime Happens | Better make sure you’re covered | Learn more ]
The Direct Impact of Cloud Downtime Incidents
The Parametrix Monitoring System indicated that Microsoft Azure experienced downtime due to DNS server errors (April 1) and authentication errors (March 15). Completely different reasons, but the same result for their customers – unexplained downtime with no immediate indication regarding when the services will be back up.
What happens then?
- You lose your connection to your customers who end up blaming you for the downtime, not knowing the real reason, so your reputation suffers (not to mention your marketing team).
- Your employees rush to deal with unexpected technical and service issues, instead of checking items off their to-do list, so productivity suffers (not to mention your IT/R&D teams).
- You have to compensate your customers for the downtime based on your SLAs and you rack up revenue losses, so your balance sheet suffers (not to mention your finance team).
Top Three Ways to Minimize These Impacts
- Select Stable Regions
Downtime events differ in their geographical impact. The April 1st downtime event only affected Azure’s East US and Central US regions. Similarly, Amazon had multiple incidents in its East US region over the past few years, more than any other region.There are several possible reasons for this – US-East-1 is the default AWS region, it has the largest number of customers, it has more availability zones than others, and it rolls out the newest AWS versions first.Regardless of the reason, deciding to change to a different region can help you mitigate the risk of downtime. At the same time, moving may not be such an easy decision for you to make due to your specific infrastructure and or due to potential cost increases.
- Build for Redundancy
Following third-party outage events, many companies contemplate adding redundant data storage and cloud products in order to improve their chances of maintaining business continuity. A multi-cloud or multi-region strategy does offer protection against cloud downtime, but it can turn out to be prohibitively expensive.Duplicating your infrastructure and running workloads across two different regions or cloud providers could double or even triple your cloud service costs, so this may not be a financial possibility. Before you choose this option, you need to carefully calculate the service costs involved compared to your potential losses.
- Purchase Insurance Coverage
The final option doesn’t stop outages, but it does transfer your downtime risk with an affordable solution. Downtime insurance policies offer companies with mission-critical services in the cloud flexible coverage that is tailored to their needs in terms of timing and coverage.These policies, which are based on innovative parametric models, can ensure your company’s financial stability. They result in quick payouts soon after the event without requiring a long claims process, allowing you to cover all financial responsibilities, including SLA liabilities and customer compensation.