Azure and Safe Deployment Practices

Azure and Safe Deployment Practices

By Shamir Charania
on October 3, 2020

On September 28th, 2020, Microsoft released an update to Azure Active Directory (AAD) which caused world-wide service degradation for authentication to Azure, for customers using Azure Business-to-Consumer for their applications, and with usage of various Azure components. You can read the full root cause analysis (RCA) of this incident here.

A couple of key points in the RCA:

Telemetry was provided that detailed failures as a percentage of authentications by region
Users who had pre-authenticated were “less likely” to experience issues (suggesting that refresh token capability succeeded with a higher percentage than first-time login)
Microsoft claims that it detected the service degradation within minutes of initial impact
Microsoft identified a flaw in it’s Safe Deployment Practices (SDP) as the cause of the outage

It goes without saying that AAD is a core component to the Microsoft eco-sphere. This is even before the fact that many Azure services are now also switching to use AAD as the authentication mechanisms for the data plane. Think about services such as Azure SQL which allow for AAD authentication side-by-side with traditional SQL authentication. In fact, Azure policy/compliance even recommends that AAD be used as the primary authentication mechanism, and recommends disabling SQL authentication.

There have been many posts already focused on the outage and impact to customers. The goal of this post is to look at the public material as it relates to the SDP framework that Microsoft uses.

What is the SDP Framework?

As per a recent blog post on the topic, the goals of the SDP framework are simple:

“Prevent broad outages associated to changes, fixes, and new features”

The SDP accomplishes this by effectively partitioning their production environment in to multiple rings, forcing their deployments to transit all the rings prior to broad deployment across the platform.

Dev, Test, Prod and Beyond

We’ve likely all been in a scenario where code (or configuration for those DevOps professionals) has caused some type of production outage. These outages occur despite our best efforts to provide adequate development and test environments to catch these issues prior to go live. The truth is, there is only so many issues you can catch in lower environments, or by performing unit tests.

The SDP Framework effectively creates another set of environments within what we would typically call “production”. These rings, as the framework refers to them, represent sets of use cases and/or customers where changes are rolled out to first. The idea here is to allow for testing of the change against a production setting, while limiting the blast radius to only users or situations that can tolerate it.

From an Azure perspective, these rings are broken out by regions in which a particular service is deployed. The diagram above refers to a Canary region, where customers who have signed up for early access programs live. This is likely also where Microsoft internal operations live.

I think the concept of deployment rings in a production setting makes a lot of sense, but also likely requires a lot of engineering to make it work properly. Even when considering a traditional web application deployment, routing of users to appropriate rings based on tolerance/preference might take more than a smile. And then what happens when changes need to be made to the routing layer? Detailed planning of what a ring is as it relates to what you are deploying is a must.

Reporting and Signal Tracking

In the video on the subject, Ed Glas talks a fair bit about early implementations of the SDP as it relates to Visual Studio Team Services (VSTS), now known as Azure DevOps. One of the areas that I found quite interesting was discussing the feedback systems in place to understand the health of a change. At the time the video was posted it seemed like those feedback mechanisms were quite crude, effectively relying on user feedback mechanisms and a pre-determined waiting period.

While I’m sure things have progressed since the video was created, I’d love to ask the following questions:

How are feedback signal requirements determined for any given change, and in particular, how are they worked in to features that may be behind feature flags?
Does the process now make use of automated decision makers, presumably looking at the feedback signals, to determine if a change is “ready to go” to the next ring?
Do/Can health reporting dashboards, etc, automatically change to display key feedback signals as it relates to a given deployment?

Conclusion

Most of my time these days is focused around the concept of DataOps. I think what I’d like to do is understand more about the SDP and how it relates to traditional deployments (like web apps or micro services) and then further understand how to apply it to the realm of data. Can scenarios such as streaming ingestion be tested/deployed/used according to the SDP framework? It would be interesting to explore that.

About Shamir Charania

Shamir Charania, a seasoned cloud expert, possesses in-depth expertise in Amazon Web Services (AWS) and Microsoft Azure, complemented by his six-year tenure as a Microsoft MVP in Azure. At Keep Secure, Shamir provides strategic cloud guidance, with senior architecture-level decision-making to having the technical chops to back it all up. With a strong emphasis on cybersecurity, he develops robust global cloud strategies prioritizing data protection and resilience. Leveraging complexity theory, Shamir delivers innovative and elegant solutions to address complex requirements while driving business growth, positioning himself as a driving force in cloud transformation for organizations in the digital age.