The best explanation for databricks is that it is a commercial offering of spark implementations. If you are doing anything in the big data or ML world, you are likely using (or have used) some type of processing engine for large datasets. Effectively, spark is a distributed data processing engine that is built for speed, ease of use, and flexibility.
Databricks takes spark to the next level by bundling in cluster and job automation. Bundled with connectors, custom-built libraries, and other features, databricks aims to be the “unified platform for massive scale data engineering and collaborative data science”.
Databricks itself can be run on multiple cloud platforms, however, in this post we will focus on Azure databricks. You can find out more about databricks by visiting https://databricks.com/product/unified-data-analytics-platform
From an azure perspective, here are the following high-level building blocks for an Azure databricks deployment.
Obviously, Azure resources is going to be the first component in your stack. You will likely want to create a resource group for appropriate permission and lifecycle management. Databricks are organized by workspaces, and, luckily, they can be created via ARM template. You can find the schema reference https://docs.microsoft.com/en-us/azure/templates/microsoft.databricks/allversions
When deciding on how many databricks workspaces to create, you’ll need to consider the following:
From a security perspective, you will likely want to target VNET Injected workspaces.
This feature allows you to deploy the clusters associated with your databricks in your own virtual network. This then allows you to make use of other azure components, such are network security groups, service endpoints, private endpoints, etc, to secure your overall big-data infrastructure.
From an Azure RBAC perspective, you’ll want to make sure only appropriate users have access to the resource. A couple of notes:
From a security perspective, you’ll want to limit the amount of contributors on the resource itself. This is pretty typical security advice.
For the databricks API part, databricks has provided a CLI tool to assist with post workspace configuration activities. You can read more about the current state of the API here.
Personally, I’m not a fan of the toolset as it is currently built. I built my own in powershell where I could properly secure credentials, have additional convenience methods, and package functionality for downstream use. The CLI tool, at time of writing, uses personal access tokens that need to be created in the databricks workspace, and assigned to a particular user. There is a way to use AAD tokens, which is currently in some form of preview, so look for that in the future.
Once you have things going, you’ll likely want to look at clusters, instance pools, and workspace(directories) from a configuration standpoint.
The last component is the databricks API access. Databricks workspaces have their own concept of users and groups, and then you can further assign those users and groups specific permissions in the workspace itself. The API sits outside of the Azure landscape, so you will also likely want to control who has access to the API (and from where). A couple of notes:
In conclusion, while this isn’t an exhaustive blog post, I’m hoping that you got some understanding of the steps required to automate your databricks setup. As with everything, there are architectural and security considerations that should factor in to your design.
Shamir is a Microsoft Most Valuable Professional (MVP – Azure) and has extensive experience building solutions in the cloud, from strategy to deployment to automation