We are continuing on with our discussion about devops and security concerns with Azure Databricks. In this post, we are focusing on user provisioning in Azure Databricks by way of the System for Cross-domain Identity Management (SCIM).
As a quick refresher, the Azure Databricks “service” is really just a multi-tenanted application. The application itself is secured by Azure Active Directory, which means that it uses that source as an authentication provider. Once the authorization step is complete, the Databricks service will look at the claims issued by the authentication provider to determine the right authorizations to grant the user. In the case of Databricks, the user in the claim is correlated with the Databricks databases to determine which workspaces the user is assigned to and which permissions that user has on that workspace. Permissions is a topic we will cover in a later blog post.
What this ultimately means is that a list of users and group memberships exist inside of the Azure Databricks service that is clear and separate from any setup inside of Azure Active Directory. So the question becomes, how do you build/maintain this configuration in Databricks, ideally drawing from your Azure active directory tenant? This is the purpose of SCIM integration.
Quoting the official standard (https://tools.ietf.org/html/rfc7643)
“The System for Cross-domain Identity Management (SCIM) specifications are designed to make identity management in cloud-based applications and services easier. The specification suite builds upon experience with existing schemas and deployments, placing specific emphasis on simplicity of development and integration, while applying existing authentication, authorization, and privacy models. Its intent is to reduce the cost and complexity of user management operations by providing a common user schema and extension model as well as binding documents to provide patterns for exchanging this schema using HTTP.”
Unlike authentication systems (such as SAML2 or WebSSO), the SCIM protocol and schema is focused on moving user and group data around cloud systems. Within the realm on SCIM, we talk about triggers, flows, and actors, all in relation to how authentication “data” can be moved and synchronized between cloud systems.
While a full discussion of SCIM is out of scope here, I’d like to point you to https://tools.ietf.org/html/rfc7642 which has a broad discussion about SCIM and it’s use cases. I’ll quote a bit here:
Create SCIM Identity Resource - Service On-boarding Trigger: A “create SCIM identity resource” trigger is a service on-boarding activity in which a business action such as a new hire or new service subscription is initiated by one of the SCIM Actors. In the protocol itself, service on-boarding may well be implemented via the same resource PUT method as a service change. This is particular to the implementation, and not to the use cases that drive that implementation.
Update SCIM Identity Resource - Service Change Trigger: An “update SCIM identity resource” trigger is a service change activity as a result of an identity moving or changing its service level. An “update SCIM identity” trigger might be the result of a change in a service subscription level or a change to key identity data used to denote a service subscription level. Password changes are specifically called out from other more general identity attribute changes as they are considered to have specific use-case differences.
Delete SCIM Identity Resource - Service Termination Trigger: A “delete SCIM identity resource” trigger represents a specific and deliberate action to remove an identity from a given SCIM service point. At this stage, it is unclear if the SCIM protocol needs to identify a separate protocol exchange for service suspension actions. This may be relevant as target services usually differentiate between these results and thus may require separate resource representations.
Single Sign-On (SSO) Trigger - Service Access Request: A “Single Sign-On” trigger is a special class of activity in which a Create or Update trigger is initiated during an SSO operational flow. The implication here is that, as the result of a service access request by the end user (SSO), defined SCIM protocol exchanges can be used to initiate SCIM resource CRUD operations somewhere in the service cloud.
While Azure Databricks offers a SCIM API (https://docs.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/scim) we luckily don’t have to build our own provider and our own flows. Azure Active Directory has hooks from a SCIM perspective that automatically takes changes and synchronize them to your Databricks workspace. Detailed documentation can be found here:
A couple of important notes:
In conclusion, SCIM integration is an effective way to help manage your Azure Databricks from a user and group perspective. Even though you can’t sync permissions/entitlements, you can link Azure Databricks permissions to groups to make the cycle complete. Being tied to Azure Active Directory makes it easy to fit Databricks access provisioning in to your other onboard/offboarding user flows.
Shamir is a Microsoft Most Valuable Professional (MVP – Azure) and has extensive experience building solutions in the cloud, from strategy to deployment to automation