in Blog

September 27, 2023

Implementing MLOps with Databricks

Author:

Artur Haponik

CEO & Co-Founder

Reading time:

12 minutes

In this era of data-driven decision-making, machine learning (ML) integration has become increasingly important among organizations seeking to gain a competitive advantage in their respective industries. According to a recent survey by McKinsey, 56% of organizations today use machine learning technology in at least one of their business operations. [1] This means that Artificial Intelligence (AI) and machine learning will benefit more than half of organizations worldwide in the coming years.

As machine learning and associated technologies continue to grow, so has the need to develop a systematic approach that ensures seamless and efficient collaboration, deployment, monitoring, and continuous improvement of models in production. This is where MLOps comes in. MLOps provides a framework that bridges the gap between machine learning and operations to enable organizations to deploy machine learning models effectively and efficiently.

In this context, Databricks, a unified and open analytics platform, has emerged as a valuable partner for implementing complex MLOps workflows. It allows organizations to manage the end-to-end machine learning lifecycle and ensure all their models are accurate, effective, and reliable in real-life environments.

This post will provide an in-depth review of the steps, strategies, and best practices that ensure the successful implementation of MLOps with Databricks.

What is MLOps?

MLOps refers to a combination of systematic processes, technologies, and best practices to operationalize machine learning models into production. This paradigm aids the collaboration and communication between data scientists and operations professionals in organizations. MLOps combines DevOps, DataOps, and ModelOps to make business operations more efficient, reliable, and secure. The main components of MLOps are code, data, and machine learning models.

Why implement MLOps with Databricks?

Implementing MLOps with Databricks offers teams a wide variety of benefits that contribute to more effective and efficient machine learning operations. These benefits include:

A unified platform

The biggest advantage of implementing MLOps with Databricks is that the latter provides you with a unified platform where you can manage the three major components of machine learning operations with unified access control. This unified approach enhances collaborations between data scientists and operations professionals, thus minimizing risks and delays associated with managing data and deploying machine learning models.

Scalability and performance

Databricks provides scalable clusters ideally designed to accommodate data sets, computations, and machine learning models of any size. This scalability is particularly important when training machine learning models on large-scale data processing engines such as Apache Spark, Google BigQuery, Apache Hadoop, Apache Flink, Lumify, Apache Sqoop, and many others. [2]

MLflow integration

MLflow is basically an open-source platform used for managing machine learning workflows. This platform provides you with the necessary tools for tracking experiments, model versioning, packaging code, and deploying models. All these enable you to implement MLOps strategies more effectively.

Lakehouse architecture benefits

By implementing MLOps practices with Databricks, you’re able to leverage the benefits of Data Lakehouse. [3] This open data management architecture combines the best features of a data warehouse and data lake. With the help of Data Lakehouse architecture, you’ll be able to use the same data source for your ML workflow without moving data from one data warehouse solution to another. This makes the implementation of MLOps practices more flexible, scalable, and cost-effective.

Simplified ML workflow

With the help of Databricks, you can easily eliminate silos, reduce ML workflow complexity and improve overall productivity. This is because data preparation, model deployment, and monitoring are all done on the same platform.

Automated deployment

One of the best things about Databricks is that it allows you to automate machine learning workflow and model deployment using REST API and a scripting language such as Python or Bash. The generated script is capable of creating a new workflow and adding necessary steps to it. The automation of ML workflow and model deployment minimizes the chances of manual errors during deployment. It also guarantees consistent model releases.

Improved model training

With the help of the AutoML system, Databricks reprocesses the data involving specific tasks such as feature engineering and normalization. This makes the data more suitable for training machine learning models in real-time, thus improving the accuracy of the resulting models.

Supercharge your MLOps implementation with Databricks. Discover how our Databricks Deployment Services can streamline your journey towards data-driven success.

Best practices for implementing MLOps with Databricks

The successful implementation of MLOps with Databricks involves several best practices, including the following:

Create a separate environment for every stage

In Databricks, an execution environment refers to a place where machine learning models and data are consumed by code. Generally, each execution environment consists of compute instances and their durations.

When implementing MLOps with Databricks, it is highly recommended to create separate environments for the different phases of machine learning code and model development. Additionally, each environment should include clearly defined transitions from one stage to another.

Access control and versioning

Access control and versioning are key components of any successful software operations process, including implementing MLOps with Databricks.

Databricks recommends the following steps:

Use Git for version control: This way, code can be prepared and developed within notebooks or in integrated development environments (IDEs). Additionally, you’re advised to use Databricks Repos integration to integrate with your Git provider to ensure seamless synchronization to Databricks workspaces.
Store data in a Lakehouse architecture: For optimal data management, it’s highly recommended to store data in a Lakehouse architecture in your cloud account. Basically, raw data and feature tables should be stored in Delta Lake format with access controls limiting who can access, read, and modify them.
Manage ML models and model development using MLflow: Using MLflow, you can easily track the model development process and save code snapshots, metrics, model parameters, and other descriptive data on the go.

Deploy code, not models

In most cases, Databricks advises users to deploy code instead of models from one environment to another. Doing this ensures that all the code in the machine learning development project undergoes the same review and testing processes. Additionally, it ensures that the final production version of the resulting model is trained using production code.

Steps for implementing MLOps with Databricks

Here is a step-by-step guide for implementing MLOps with Databricks:

Source: Databricks.com

The development stage

The main idea behind the development stage is experimentation. Within this stage, data scientists and engineers work together to create machine learning pipelines. They also develop the necessary features and run various experiments to optimize the performance of the resulting model.

Some of the steps involved in the development stage include:

Data Preparation and Management

The first step in this process is usually to upload the data to Databricks. Data loading can easily be done with Java Database Connectivity (JDBC) or by simply uploading the CSV file. Once you’re done uploading the data, you can create tables from there. In most cases, data scientists and engineers working in the development environment have read-only access to the production data.

However, the development environment may have access to a mirror version of production data to meet data governance requirements. Notably, data scientists have a separate development storage environment where they develop and experiment with new features and data tables.

Exploratory Data Analysis (EDA)

Data professionals often engage in an interactive and iterative process of exploring and analyzing data. This is done using various tools such as Databricks SQL, AutoML, and dbutils.data.summarize. The EDA process may not necessarily be deployed in other execution environments.

Code

All the code you might need for an ML system is usually stored in a code repository. From time to time, data professionals create new and updated ML pipelines in the development stage of the Git project. You can easily manage your code and model artifacts inside or outside Databricks and synchronize it with Databricks workspaces using Databricks Repos.

Update Feature Tables

When implementing MLOps with Databricks, you need to transform raw data into something useful that the resulting ML model can use to make better predictions. The model development pipeline in this project reads raw data and existing feature tables saved in the Feature Store. Data scientists use development feature tables to create prototype models.

Once the code has been promoted to the production stage, these changes update the production feature tables. [4] Most importantly, feature tables saved in the Feature Store can be reused by other team members to understand how they were built.

It’s important to note that feature tables can be managed separately from other machine learning pipelines, especially if they’re owned by different teams.

Model Training

Once you’re done with feature engineering, the next step is to train and tune your model and select the right algorithm or architecture so that your model can achieve the desired accuracy. With the help of Databricks’ AutoML, you can easily select data from the Data tab and get it to perform numerous selections automatically. This helps save data scientists hours of effort that would have been spent building these models from scratch.

Even better, the interface only involves a few dropdowns and is fully integrated with the feature tables and delta tables that have been created and stored in the Data tab. After picking a prediction target and deciding how to handle imputation, AutoML will handle the final stages of the model selection. Once the model has been selected, trained, and tuned, it needs to be tested for quality on held-out data. Afterward, the results will then be logged to the MLflow tracking server.

Commit Code

To advance the ML workflow toward production, data professionals are required to commit the code required for featurization, training, and other ML pipelines to the source control repository. This step marks the end of the development lifecycle.

The staging stage

The main focus of the staging stage is to thoroughly test the ML pipeline code and ensure it’s ready for production. This testing is done in an environment that closely resembles the production setup. All the project’s code is tested in this stage. This includes the code for model training as well as code for feature engineering pipelines.

Machine learning engineers are usually tasked with creating a Continuous Integration (CI) pipeline to implement the unit and all the tests that take place in this stage. The end product of the staging stage is a release branch that triggers the Continuous Integration (CI)/Continuous Delivery (CD) system to set off the production stage.

Here are the steps involved in the staging stage:

Merge Request

The staging stage begins with a merge request from an ML engineer to the staging branch in source control. Once the request has been sent, a robust CI/CD process begins.

Unit and Integration Tests

Unit tests are executed within the CI pipeline to ensure that all individual components of the pipeline function properly. Afterward, end-to-end integration tests are executed to validate the machine learning workflows on Databricks. If both unit and integration tests are passed, the code changes are merged to the staging branch. If the tests fail, the CI/CD system will notify the user and post the results on the merge request.

Create a Release Branch

Once all the tests have been performed and the ML engineers are confident in the updated ML pipelines, they will proceed to create a release branch. This triggers the CI/CD system to update the production jobs.

The production stage

ML engineers usually oversee the production environment where the ML pipelines are deployed to serve various applications. Major pipelines in the production stage involve steps such as:

Update Feature Tables

Once the production data becomes available, this ML pipeline ingests it and updates the feature tables saved in the Feature Store. This pipeline can be operated continuously as a streaming job or as part of a batch, or triggered by certain events.

Model Training

In the production stage, model training is either triggered by code changes or scheduled to train a new model using the latest production data. Once the training is complete, the trained models are registered in the MLflow Registry.

Continuous Deployment (CD)

The registration of the trained models triggers the CD pipeline, which performs a series of tests to ensure the resulting model’s sustainability for deployment. These tests involve compliance checks, performance evaluations, and A/B comparisons against the current production model.

Model Deployment

Once the tests have been passed, the model is deployed for serving or scoring. Online serving and batch/streaming scoring are the most common deployment modes.

Monitoring

You’re required to monitor input data and model predictions of statistical properties such as data drift, model performance, and other relevant metrics. Regardless of the deployment mode you use, you can log the resulting model’s input queries and predictions to Delta Live Tables.

Retraining

The last step for implementing MLOps with Databricks is retraining the resulting models to ensure they stay up-to-date and relevant to current events. Databricks accommodates both automatic and manual retraining approaches.

Final thoughts

Although MLOps is not a new concept, many organizations spend many hours training and tuning ML models. Additionally, they have to work closely with ML engineers and data scientists to ensure they’re using the latest data and testing their model before deployment. This process has proven to be costly, time-consuming, and tedious.

Fortunately, with the help of Databricks, data professionals can now speed up the process of getting to a baseline model. They can also use the generated code to build a production model in the shortest time possible and collaborate seamlessly with ML engineers and data professionals using the built-in interfaces.

References

[1] Mckinsey.com. The State of AI in 2021. URL: https://www.mckinsey.com/capabilities/quantumblack/our-insights/global-survey-the-state-of-ai-in-2021, Accessed August 30, 2023
[2] Analyticsindiamag.com. Alternatives to Apache Spark. URL: https://analyticsindiamag.com/top-8-alternatives-to-apache-spark/, Accessed August 30, 2023
[3] Databricks.com. Data Lakehouse. URL: https://www.databricks.com/product/data-lakehouse, Accessed August 30, 2023
[4] Databricks.com. Feature engineering in Workspace Feature Store. URL:
https://docs.databricks.com/applications/machine-learning/feature-store/feature-tables.html. , Accessed August 30, 2023

Category:

Data Engineering

Share this article: