Parametrix cloud monitoring system in action during Microsoft Azure 2021 outage. See how it uses the data it collects to analyse future events, inform modeling processes, and produce data-driven insights.
Microsoft Azure experienced errors while performing authentication operations for more than three hours starting 18:40 UTC on March 15 2021 when our monitoring system began detecting high error rates on the Microsoft service. This incident was widely reported online due to the fact that widely-used Microsoft services such as Microsoft 365 and Microsoft Teams were down for several hours.
The incident seems to have been caused by the Azure Active Directory service reducing capacity for its authentication service and returning errors so that it was unable to authenticate users. It was relevant to managed Microsoft applications including Teams, Microsoft 365, Exchange and Xbox which depend on Azure Active Directory, a Microsoft enterprise identity service which provides single sign-on and multi-factor authentication access.
Customers that directly use Azure Active Directory as part of their production systems felt a partial impact on some services such as Azure VMs and Azure Storage. But since this only impacted management features such as the creation of new processes, it did not affect any existing applications, processes or operations that were already running on Azure.
The Parametrix Monitoring System identified the outage as soon as it began. It monitored Azure’s error rate during the incident and identified a peak error rate of over 75%, meaning that more than 75% of system management requests were failing, with a focus on authentication problems.
Our system identified only a slight degradation of service in Azure Storage since the service continued to function. As you can see, there was a 10% peak error rate per region lasting only a few minutes. The failed operations themselves were not critical management ones such as the Azure Storage Account operations.
Although there were high error rates in the SQL & Virtual Machines services, their errors were only in the management aspect of the service so the instances that were already running did not suffer an outage.
It monitored exactly which services were down at the exact moment the downtime occurred, and specified exactly what did or didn’t work on each service.
Parametrix uses the data it collects to analyse future events, inform our modeling processes, and produce data-driven insights that influence strategy at the company and at the market level.
This incident and all the publicity surrounding it demonstrates the market’s ever growing reliance on cloud and the need for insurance policies that cover downtime caused by third-party IT providers for services such as cloud, ecommerce, payments and communications.