Author:
Reading time:
Modern open-source data lake platform accelerating innovation across data science services, data engineering, and business analytics through collaborative workspaces. Databricks can be understood as a fully managed Apache Spark service with computing and storage layers. Which can be effortlessly scaled depending on the needs. Let’s take a closer look at Delta Lake on Databricks.
The computational part includes such features and services as integration with external data sources and BI tools, ML lifecycle management tools integration. Common workspaces for data exploration, data engineering, and job scheduling. Storage functionality includes Delta Lake on which we’ll focus today in this article. In this way, Data Lake on Databricks includes everything to build together all the data from a wide variety of sources and drive business outcomes out of it.
Delta Lake technology can be highly beneficial for companies operating in different industries. Try data science services to find out how your business can take advantage of it.
Among common issues traditional data lake has are:
Often those issues are not discovered until the analyst begins to query the data and finds out that there was a schema change or duplicated job a few months ago. It is resulting in additional efforts to be done by Data Engineers to figure out what exactly happened. Then re-run all the pipelines for the impacted period of time. A solution to the above challenges comes in Delta Lake – a new standard of building data lakes.
Delta Lake is a new open-source solution for building data lakes based on parquet files format. It’s introducing reliability and enhanced performance of data querying being at the same time fully compatible with Spark API. Thus it is very easy to implement it by just changing the type of the table or data frame from ‘parquet’ to ‘delta’. While having all the underlying tables stored as parquet files in the storage, which can be an Azure Data Lake, Azure Blob Storage or AWS S3 for example.
Moreover, all the code can be written either in Scala, Python, or Spark SQL. It enables native usage of the tool for a wider range of employees. From Data Scientist to Business Analysts.
Delta Lake solves reliability issues providing parquet files along with transaction log. The latest results into ACID Transaction Guarantees – Atomic, Consistent, Isolated, Durable. Every time we run queries changing the state of our table (which is physically stored as a parquet file), Delta Lake creates a new version of the data. Thus we would have multiple copies of historically changed data and that would enable us to make use of time travel – we now are able to quickly check how the table looked like at some point in time in the past.
It also provides schema enforcement. No data can be written to a path if it is not of a specific schema, and schema evolution. As an opposite to enforcement allowing schema to change over time. Then write all new and changed columns to the tables without failing jobs.
Delta Lake makes it possible to stream in and out of the delta lake simultaneously with running batch jobs. As a result of versioning and isolation, each job and user running a query on the data gets a consistent isolated snapshot view of the data.
Usually, the architecture design pattern of Delta Lake will consist of the following steps:
Such an approach with multilayering allows for elasticity as well: for the Baselayer we can allow for schema evolution to capture changed schemas on the initial stages. However, a good practice would be to enforce a schema on the Final level to ensure the correctness of BI reports or production and successful running of other jobs using that data.
Databricks optimizes the performance of querying data in Delta Lake by applying indexing with so-called ‘Z-ordering’ – reordering data in particular files while joining it to improve the speed of operation. Having indexed data and knowing its order, Data Lake on Databricks can skip entire files while running a query against data: when running a query it analyzes which files should be opened and performs actions only on them.
Additionally, it has a built-in process of compacting smaller files into larger ones in line with the optimal size of parquet files for Spark to have the best performance as large numbers of small files degrade Spark jobs a lot. Delta Lake on Databricks also does cache on the queries’ output and saves the results on the SSD of the cluster, thus when another similar query is being run, it might use cached results of the first query significantly improving performance.
To find out more about AI-driven tools, explore our blog section. If you are interested in implementing this solution in your company, contact us.
Category: