in Blog

September 11, 2023

Cost optimization in Databricks through resource optimization

Author:




Edwin Lisowski

CSO & Co-Founder


Reading time:




8 minutes


Over the years, Databricks has emerged as the go-to platform for organizations looking to process, analyze, and visualize large data volumes. According to a recent report by 6sense, over 9731 organizations worldwide use Databricks as a big data analytics tool, with the United States accounting for about 46.19% of the platform’s total customers. [1] However, like with most cloud providers, using Databricks can be quite costly for some organizations. In fact, the cost of running a Databricks environment can easily hit six figures when left unmanaged and without the proper guardrails.

That said, cost optimization in Databricks through resource optimization is important for managing large amounts of enterprise-grade data, analytics, and machine learning (ML) solutions. By optimizing Databricks resources, users can maximize the value of their investment while minimizing the costs of running the platform.

This post will provide an in-depth review of best practices and strategies that will help you achieve cost optimization in Databricks without sacrificing quality and performance.

Databricks pricing model explained

To effectively optimize costs in Databricks, you need to understand the platform’s pricing model and all the costs involved in running the platform. Databricks uses a consumption-based pricing model to charge users for the compute resources they consume on the platform. This means that the more you ‘consume,’ the more you pay. These resources include data processing workflows scheduling/management, compute management, data ingestion, data discovery, machine learning modeling, data annotation, security management, data exploration, source control, and many others.

Every time you use any of these Databricks resources to train machine learning models or run ETL pipelines, you consume computation power measured in terms of Databricks units (DBUs). In simple terms, DBUs conveniently measure the amount of computation power you use on Databricks per hour. Notably, billing on Databricks is based on per-second usage. To find how much it will cost you to run a Databricks environment, multiply the number of DBUs used by the dollar rate of each DBU.[2]

There are several factors that determine how many DBUs it takes to run a Databricks environment, including the subscription plan tier, the amount of data it processes, how much memory it requires, location, compute type, and the cloud service used. Databricks supports several cloud service providers, including Google Cloud Platform, Amazon AWS, and Microsoft Azure. When it comes to compute type, the platform offers jobs compute, SQL compute, all-purpose compute, and even serverless compute.

Enhance your cost optimization strategies with our Databricks Deployment Services. Set up a Proof of Concept in days and achieve rapid insights with our support. 

Databricks free trial explained

In addition to paid options, Databricks offers a 14-day free trial to new users. During this free trial, you can access all of the platform’s features, including the ability to create clusters and get user-interactive notebooks to work with SQL, Apache Spark, MLFlow, TensorFlow, Keras, Python, and Delta Lake. However, you must contact Databricks for custom configuration if you intend to deploy the service on a private cloud.

Although Databricks won’t charge for using the platform during the 14-day free trial, the underlying cloud infrastructure will. Notably, you can cancel your Databricks free trial at any time before it expires. Once the trial period elapses, you’ll need to upgrade to a paid plan to continue using Databricks resources.

For users who do not want to subscribe to Databricks’ paid plans, the platform offers a Community Edition. This version of the cloud-based platform is free and gives you access to a micro-cluster and a limited set of features. You can also share your notebooks with other users and host them free of charge.

Read more about Best practices for Databricks PoC (Proof of Concept)

Challenges associated with Databricks billing

Every time you want to pay your Databricks bill, you must get a separate invoice from the cloud service provider. While this process may seem straightforward and sensible to some Databricks users, it’s complex and tedious for others.

That said, here are other challenges associated with Databricks billing:

Time and effort needed to integrate billing data

Running a Databricks environment involves integrating your Databricks billing data with your cloud spend. Unless you have sophisticated software to complete this process, combining both invoices will require time and manual effort. This process can also be error-prone, translating to inflated operational costs.

No spending quardrails

Automated cost control alerts help organizations spend wisely on data management and analytics. Unfortunately, Databricks lacks robust cost-alerting functions. As a result, it’s not unheard of for Databricks users to spend tens or even thousands of dollars on the platform without realizing it.

The risk of double charges

When using Databricks, you incur two main charges: the cost of licensing a platform and the cost of running the platform on Amazon’s EC2. As a result, it’s difficult to assess the overall cost of running a Databricks environment and its expected ROI.

Difficulty in tracking costs

Another challenge with Databricks billing is the difficulty in differentiating the costs related to various resources and capabilities, leading to inaccurate cost tracking. It’s also difficult to identify the specific business units driving your organization’s Databricks expenditure.

Best practices for Databricks cost optimization

The following are best practices for Databricks cost optimization to help you reduce unnecessary expenses on the cloud-based platform:

Use the DBU calculator

One of the best ways to achieve cost optimization in Databricks is by using the Databricks Unit (DBU) calculator. With the DBU calculator, you can easily estimate the costs of running workloads on Databricks, and identify the key areas for cost optimization. This way, you can adjust cluster sizes, compute types, instance types, and other aspects to ensure you use the most cost-effective configuration available.

The DBU calculator also helps you understand how different factors affect the overall cost of running a Databricks environment. This allows you to make more informed decisions and minimize costs without sacrificing performance.

It might be also interesting for you: Mastering Databricks Deployment: A Step-by-Step Guide

Use the appropriate instance type

Choosing the right instance type for your workload on Databricks will ensure optimal performance and cost-efficiency on the cloud-based platform. Different instance types on Databricks are optimized for different workloads. Therefore, it’s highly recommended to choose an instance type that aligns with your respective workload characteristics.

For example, the Amazon EC2 M5 family instances are general-purpose instances, meaning they provide balanced compute, networking, and memory resources for a wide variety of use cases. [3] On the other hand, Amazon EC2 R5 are memory-optimized instances, making them ideal for memory-intensive applications such as real-time big data analytics, in-memory databases, distributed web scale in-memory caches, and high-performance databases. [4]

Enable autoscaling

Autoscaling is a feature on Databricks that enables clusters to scale up to process large amounts of data when needed and scale down when not in use. By enabling this feature, you can provide the minimum or maximum number of workers needed for a given cluster. Databricks will then choose the exact number of workers needed to complete the job. With autoscaling, Databricks will automatically add more workers during the computationally demanding phases of your job and remove them when they’re no longer needed. [5]

This feature ensures you have the resources you need to handle various workloads without overspending. This is particularly crucial when handling workloads whose requirements change over time. In the long run, autoscaling helps optimize cluster utilization and minimize costs.

Tag the clusters

Cluster tagging is a feature on Databricks that allows you to apply tags to clusters and pools at the beginning of a project. Applying cluster tags makes it easier for you to monitor the amount of resources used by different teams in your organization. This way, you can easily attribute resource usage to a specific department in the organization and identify areas where cost optimization is possible.

However, to ensure that cluster tags are used appropriately, the organization’s management and data engineers must come up with policies that enforce effective cluster tagging.

Take advantage of spot instances

Spot instances refer to instances that use spare EC2 capacity and can be purchased at a discounted price. Cloud service providers usually put spot instances up for bidding in a live marketplace. That said, spot instances are available on a first-come, first-served basis, and their prices vary depending on supply and demand. By purchasing spot instances, you can easily save up to 90% on compute costs.

Before purchasing spot instances, it’s important to consider the risk of interruptions. Since spot instances are spare capacity, the cloud service provider can easily reclaim them if their demand increases. This can result in costly delays and even failed jobs. The best way to mitigate this risk is through data replication and continuously monitoring instances of potential interruptions.

Final thoughts

There is no denying that Databricks is a reliable platform for processing and transforming large volumes of data. However, Databricks cost optimization is vital to ensure you’re not overspending on your projects. By implementing the above best practices, you can effectively minimize your expenses while maintaining the quality and agility of your workload.

References

[1] 6sense.com. Market share of Databricks. URL: bit.ly/45Z0xYs.  Accessed September 7, 2023
[2] Databricks.com. Databricks Pricing. URL: https://www.databricks.com/product/pricing. Accessed September 7, 2023
[3] Cloudzero.com. M5 Instance Types. URL: https://www.cloudzero.com/blog/m5-instance-types. Accessed September 7, 2023
[4] AWS.amazon, com. Amazon EC2R5 Instances. URL: https://aws.amazon.com/ec2/instance-types/r5/
[5] Databricks.com. URL: https://docs.databricks.com/en/clusters/configure.html#benefits-of-autoscaling. Accessed September 7, 2023



Category:


Data Engineering