Three ways Risk Managers can protect their companies from cloud outages

May 26, 2022

On May 6, 2022, services housed on Google Cloud Platform’s us-central-1 faltered. Customers relying on impacted services experienced high latency and errors in multiple services. The outage didn’t gain national headlines like AWS’s December crashes or Fastly’s June CDN outage. But for anyone assessing risk, the takeaway is clear: you never know timing or the severity at which downtime will hit your business.

Here are some guidelines to make sure your company is prepared for outages.

‍

Say it out loud: Downtime Happens

According to Parametrix data, in 2021 cloud services were interrupted every 10 days, on average. These include latencies, interruptions and full-blown outages – all having some impact on business. The longest outage lasted 11 hours. But what toll do outages take on businesses? Some damages are easy to measure, while others are harder to grasp (and calculate).

Direct Revenue Loss is probably the most obvious cost and includes lost sales while a website is down or lost commission when customers can’t transact on a platform.

Lost productivity becomes an issue when internal tools fail and your workforce loses its ability to… work. No emails, no development, no deployment, no tracking.

SLAs Liability is another potential cost of downtime. An outage makes it impossible to provide service in a timely manner, and your customers will be disappointed. It won’t matter to them that it was a cloud provider, and not you who was at fault.

Recovery expenses are hard to measure. These include expenditures to lure back customers, offer incentives for lost marketing opportunities, repairing a tarnished brand and upping your PR spend to mitigate the damages.

‍

Don’t exacerbate, mitigate!

Any way you measure it, downtime is costly. According to Information Technology Intelligence Consulting (ITIC), 40% of enterprises priced an hour of downtime between $1 million and over $5 million – exclusive of any legal fees, fines or penalties. In a Gartner survey, 98% of companies stated the cost of IT downtime ranged from $100,000 to $540,000 per hour. So how can you mitigate the many risks associated with outages?

‍

Plan ahead

System Down! What’s your plan? Does it identify and address every cloud service that may crash? Is it even executable … if the cloud is down?

Atlassian gave the tech world a stark reminder on the importance of contingency plans. In April, several hundred customers lost service to multiple Atlassian cloud services. The outage dragged on for weeks as Atlassian tried to overcome a long list of issues they did not anticipate.

The software company's engineers wanted to deactivate an obsolete application. But a communication gap between two teams resulted in a script that deleted all Atlassian Cloud products - and associated data - permanently. So just bring up a backed up copy of the database for a quick recovery, right?

Atlassian had the backup, but learned fast their system couldn’t accommodate a batch recovery of all accounts. Another unpleasant surprise was the inability to communicate with affected customers. Contact information was deleted along with the services, and Atlassian was unable to directly reach some of their customers.

They had to scramble and direct immense dev efforts to build a new system that would circumvent their customer support system and to create an environment where customer data could be recreated. Planning ahead means foreseeing damages, having a solid, tested plan to fix them. It also needs to account for communication channels so that customers can be reassured that they’re being taken care of, to maintain their trust and loyalty.

‍

Introduce Redundancy

Redundancy means you’re storing your data and leveraging compute power (and other services) in more than one location. So If one location fails, you can always tap into the other.
There are various ways to achieve redundancy. On the cloud, most regions have several availability zones. So there’s a degree of redundancy which can be achieved fairly easily, usually by checking a box. But doing so will cost extra, and if the entire region crashes, this level or redundancy won’t help.

Some businesses enhance redundancy by storing data and implementing compute power across regions. Others take it a step further and store data across providers. These solutions greatly reduce the risk for damages. When storing data across regions, the business interruption is averted if an entire region crashes. And cross-provider redundancy is foolproof even when an entire provider crashes. But there’s a price.

Redundancy is expensive. You are using twice the amount of storage space, and twice the amount of computational power, and paying twice the bill, if not more.

‍

Get Insured

Downtime insurance is a new product on the market. It postulates that outages happen, and that businesses are always at risk - even if they do everything in their power to protect against them. Policies are designed to protect from the damages by providing the cash flow needed to address all aspects of recovery, fast.

These policies are parametric, meaning:

Damages are pre-assessed
Businesses know better than anyone what may go wrong and what may need fixing. So they set their own price for each hour of downtime. There’s no need to establish or prove damages if a policy is activated.

‍Triggers are transparent and clear
All services are monitored remotely (there’s nothing to install and no code to embed.) Outages are identified in real time. If an insured service crashes, the policy kicks in automatically.

Payouts are fast and hassle free
There’s no tedious claims process. If an insured service crashed, businesses are indemnified within 15 business days, after signing a declaration of loss. They can spend the money to fix any damages as they see fit and don’t need to report back.

‍

Smooth sailing

Risk managers have a very clear mission - to identify and hedge any risks that can throw a company off its main course. Any distraction or interruption can profoundly impact a business - all the more so if it’s not addressed in advance.

Outages are no exception, but they’re often overlooked. Downtime is often an error that comes from outside the organization, and that may contribute to this shortsightedness. But 3rd party tech is a growing dependency for businesses, and risk managers must understand that these risks can and should be addressed and managed.