On September 28th, 2020, Microsoft released an update to Azure Active Directory (AAD) which caused world-wide service degradation for authentication to Azure, for customers using Azure Business-to-Consumer for their applications, and with usage of various Azure components. You can read the full root cause analysis (RCA) of this incident here.
A couple of key points in the RCA:
It goes without saying that AAD is a core component to the Microsoft eco-sphere. This is even before the fact that many Azure services are now also switching to use AAD as the authentication mechanisms for the data plane. Think about services such as Azure SQL which allow for AAD authentication side-by-side with traditional SQL authentication. In fact, Azure policy/compliance even recommends that AAD be used as the primary authentication mechanism, and recommends disabling SQL authentication.
There have been many posts already focused on the outage and impact to customers. The goal of this post is to look at the public material as it relates to the SDP framework that Microsoft uses.
As per a recent blog post on the topic, the goals of the SDP framework are simple:
“Prevent broad outages associated to changes, fixes, and new features”
The SDP accomplishes this by effectively partitioning their production environment in to multiple rings, forcing their deployments to transit all the rings prior to broad deployment across the platform.
We’ve likely all been in a scenario where code (or configuration for those DevOps professionals) has caused some type of production outage. These outages occur despite our best efforts to provide adequate development and test environments to catch these issues prior to go live. The truth is, there is only so many issues you can catch in lower environments, or by performing unit tests.
The SDP Framework effectively creates another set of environments within what we would typically call “production”. These rings, as the framework refers to them, represent sets of use cases and/or customers where changes are rolled out to first. The idea here is to allow for testing of the change against a production setting, while limiting the blast radius to only users or situations that can tolerate it.
From an Azure perspective, these rings are broken out by regions in which a particular service is deployed. The diagram above refers to a Canary region, where customers who have signed up for early access programs live. This is likely also where Microsoft internal operations live.
I think the concept of deployment rings in a production setting makes a lot of sense, but also likely requires a lot of engineering to make it work properly. Even when considering a traditional web application deployment, routing of users to appropriate rings based on tolerance/preference might take more than a smile. And then what happens when changes need to be made to the routing layer? Detailed planning of what a ring is as it relates to what you are deploying is a must.
In the video on the subject, Ed Glas talks a fair bit about early implementations of the SDP as it relates to Visual Studio Team Services (VSTS), now known as Azure DevOps. One of the areas that I found quite interesting was discussing the feedback systems in place to understand the health of a change. At the time the video was posted it seemed like those feedback mechanisms were quite crude, effectively relying on user feedback mechanisms and a pre-determined waiting period.
While I’m sure things have progressed since the video was created, I’d love to ask the following questions:
Most of my time these days is focused around the concept of DataOps. I think what I’d like to do is understand more about the SDP and how it relates to traditional deployments (like web apps or micro services) and then further understand how to apply it to the realm of data. Can scenarios such as streaming ingestion be tested/deployed/used according to the SDP framework? It would be interesting to explore that.
Shamir is a Microsoft Most Valuable Professional (MVP – Azure) and has extensive experience building solutions in the cloud, from strategy to deployment to automation