Modern open-source data lake platform accelerating innovation across data science, data engineering, and business analytics through collaborative workspaces. Databricks can be understood as a fully managed Apache Spark service with computing and storage layers which can be effortlessly scaled depending on the needs. Let’s take a closer look at Delta Lake on Databricks.
The computational part includes such features and services as integration with external data sources and BI tools, ML lifecycle management tools integration, and common workspaces for data exploration, data engineering, and job scheduling. Storage functionality includes Delta Lake on which we’ll focus today in this article. In this way, Databricks includes everything to build together all the data from a wide variety of sources and drive business outcomes out of it.
Delta Lake technology can be highly beneficial for companies operating in different industries. Try data science consulting to find out how your business can take advantage of it.
Delta Lake Features
Among common issues traditional data lake has are:
- failed production jobs – leaving data in a corrupt state requiring time-consuming recovery. There is no way to roll failed jobs back and try over again.
- lack of schema enforcement – each time any data field changes or appears there is no easy way to evolve or enforce the schema.
- lack of consistency – it is difficult to understand the state of the data, prove the consistency of the data, and ensure there are no gaps or improper aggregations done historically.
Often those issues are not discovered until the analyst begins to query the data and finds out that there was a schema change or duplicated job a few months ago, resulting into additional efforts to be done by Data Engineers to figure out what exactly happened and re-run all the pipelines for the impacted period of time. A solution to the above challenges comes in Delta Lake – a new standard of building data lakes.
Delta Lake is a new open-source solution for building data lakes based on parquet files format, introducing reliability and enhanced performance of data querying being at the same time fully compatible with Spark API. Thus it is very easy to implement it by just changing the type of the table or data frame from ‘parquet’ to ‘delta’, having all the underlying tables stored as parquet files in the storage, which can be an Azure Data Lake, Azure Blob Storage or AWS S3 for example.
Moreover, all the code can be written either in Scala, Python, or Spark SQL which enables native usage of the tool for a wider range of employees – from Data Scientist to Business Analysts.
Delta Lake solves reliability issues providing parquet files along with transaction log. The latest results into ACID Transaction Guarantees – Atomic, Consistent, Isolated, Durable. Every time we run queries changing the state of our table (which is physically stored as a parquet file), Delta Lake creates a new version of the data. Thus we would have multiple copies of historically changed data and that would enable us to make use of time travel – we now are able to quickly check how the table looked like at some point in time in the past.
It also provides schema enforcement, so that no data can be written to a path if it is not of a specific schema, and schema evolution, an opposite to enforcement allowing schema to change over time and write all new and changed columns to the tables without failing jobs.
Delta Lake makes it possible to stream in and out of the delta lake simultaneously with running batch jobs. As a result of versioning and isolation, each job and user running a query on the data gets a consistent isolated snapshot view of the data.
Delta Lake Architecture Design
Usually, the architecture design pattern of Delta Lake will consist of the following steps:
- Step 1: Set up streaming and batch jobs to load raw data to the storage (eg AWS S3, Azure Data Lake Storage) in their original formats.
- Step 2: Use Databricks to combine batch and streaming jobs and save data as Delta format, thus creating a Base layer of Delta Lake data. Formatting raw data into Delta format provides additional reliability and performance improvement in further operations.
- Step 3: Perform all required to join, enrich, clean, and transform operations on data to prepare it for further usage in ML models or BI tools. This would be the Intermediate layer of Delta Lake.
- Step 4: Finalize prepared files for production usage – aggregate data, change names of the columns if required and separate those production-ready data frames for BI Tools, CSV exporting, real-time Apps, or ML models. That would be a Final layer of data, fully prepared to drive business outcomes. It is worth mentioning that if outcome tables on this stage are very big, the direct connection of BI tools can be too slow. Then, loading them into a Data Warehouse with OLAP modeling will be required.
Such an approach with multilayering allows for elasticity as well: for the Baselayer we can allow for schema evolution to capture changed schemas on the initial stages. However, a good practice would be to enforce a schema on the Final level to ensure the correctness of BI reports or production and successful running of other jobs using that data.
New Opportunities with Databricks
Databricks optimizes the performance of querying data in Delta Lake by applying indexing with so-called ‘Z-ordering’ – reordering data in particular files while joining it to improve the speed of operation. Having indexed data and knowing its order, Databricks can skip entire files while running a query against data: when running a query it analyzes which files should be opened and performs actions only on them.
Additionally, it has a built-in process of compacting smaller files into larger ones in line with the optimal size of parquet files for Spark to have the best performance as large numbers of small files degrade Spark jobs a lot. Delta Lake on Databricks also does cache on the queries’ output and saves the results on the SSD of the cluster, thus when another similar query is being run, it might use cached results of the first query significantly improving performance.