How to avoid being affected by the next AWS outage

The AWS outage of November 2020 brought down companies such as Adobe Spark, Roku and iRobot. Here are some strategies to implement to avoid the disastrous effects of third-party IT downtime and manage their risks when business is interrupted.

The AWS outage that started in the early hours of November was significant for companies who have their systems in Amazon’s Northern Virginia US-East-1 region and use its Kinesis service or other AWS services reliant on Kinesis. The impact was felt by companies including Adobe Spark, Roku and iRobot, all of which confirmed that some of their services, apps, and websites were experiencing issues. It is important to note that AWS users in other regions or those not using Kinesis were not affected.

An explanation from Amazon stated that the outage was due to a “relatively small addition of capacity” to its front-end fleet which “caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration.” This snowballed quickly and impacted several other major Amazon services in the US-East-1 region, including Amazon Cognito, CloudWatch and Lambda serverless computing infrastructure.

As part of its statement, Amazon apologized and acknowledged “how critical this service, and the other AWS services that were impacted, are to our customers, their applications and end-users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.” This event underscores the incredible dependency many companies have on third-party technology (tech that is entirely out of their control) and that they are at the mercy of Amazon, Microsoft, Google and other cloud service providers.

With this in mind, I’d like to share some strategies companies can implement to avoid the disastrous effects of third-party IT downtime and manage their risks when business is interrupted.

Cloud Strategy 1: Build Redundancy

Because of the recent outage, many companies are now discussing whether they should invest in redundant data storage and cloud products. Companies that already use this strategy wouldn’t have experienced any downtime. For those that want to ensure business continuity during the next outage, using a multi-cloud or multi-region strategy will offer protection, but the cost of doing so is prohibitive for some companies.

Duplicating your infrastructure and running workloads across two different regions or cloud providers could amount to doubling or even tripling your costs, and many businesses do not have the finances to put this strategy into practice.

In the end, if you choose this strategy, you should have a deep understanding of the risks you might face and whether you will recoup your investment by using this kind of precautionary measure. To get an estimate of the costs involved, head over to Amazon’s AWS calculator.

Cloud Strategy 2: Change Regions

AWS is no stranger to outages and downtime incidents, but its US-East-1 region suffers more than any other region. Over the last few years, we’ve identified more downtime incidents in this region than in others. The reasons for this region being most impacted could stem from the fact that it is the default region, has more availability zones than most others, holds the largest amount of customers and rolls out the newest versions first.

In March 2018, during a two-hour stretch, the region suffered from 2 separate power outages, affecting some 240 critical services, and enterprises such as Slack, Twilio, Atlassian all reported issues. It turned out that companies in the region that used AWS’s Direct Connect service were all affected.

A year earlier, AWS suffered an outage that affected its Simple Storage Service (S3) in the same region. The 2017 event was attributed to human error. While debugging an issue, a technician incorrectly entered a command, prompting an unplanned restart. The event lasted for more than 4 hours and affected Expedia, Medium, Slack and the U.S. Securities and Exchange Commission. Additionally, Apica found that 54 of the top 100 e-commerce sites experienced performance declines of 20% or more due to the event.

If the US-East-1 region is your datacenter, changing regions could be a viable strategy to help mitigate your risk. However, moving your entire system to a new region can be expensive if the infrastructure wasn’t designed to accommodate changes of this sort.

Cloud Strategy 3: Transfer Risk with Insurance Products

An insurance policy like the one offered by Parametrix Insurance can provide you with a risk transference option that even most SMEs can afford. Our policies are designed to meet the needs of today’s technology-dependent businesses by offering flexible coverage through a parametric model. Each company can determine the policy’s thresholds, and when those are met, the policy is automatically triggered.

While our policies cannot stop an outage, they can help you maintain continuity and financial stability. When determining your policy parameters, you can consider responsibilities such as SLA liabilities, discretionary payments to compensate customers, and the costs of repairing reputational damage, and more. With pre-determined parameters and payout amounts, we can quickly provide compensation that enables businesses to regain continuity as fast as possible.

Preparing For The Next Outage

When the cloud goes down, it is painful and detrimental to business. Cloud providers don’t guarantee 100% uptime all the time, so when the availability drops, as it does on occasion throughout the year, you will get credits back for the services you didn’t receive but won’t be compensated for any recovery expenses you incur including costs such as the SLAs you have with your own clients, repairing reputational damage, or the additional support hours your teams need to log to handle upset customers.

As we saw with this last AWS outage, when a company is affected, there is a ripple effect, first to its systems and availability, then to their employees and their productivity and then to its customers.

My advice is to plan for the future and do it now. Another event will happen – it’s a matter of time – and you don’t want to be caught without any defenses.

If you are unsure about how to proceed, reach out to us. We believe that companies who know more can make smarter choices, and sharing our knowledge is part of our company DNA.

Neta Rozy
Neta has a rich background developing enterprise software and robust monitoring systems. She co-founded Parametrix and built a team that is pioneering the development of a unique, global downtime event monitoring system to track SaaS, PaaS and IaaS system outages, network crashes, and platform failures down to the millisecond.
View Profile
Published
December 7, 2020