Delta Lake and the Lakehouse Architecture

The technology world often sees upheavals when disparate concepts are put together to achieve different objectives, creating something which is much more than the sum of its parts. Delta Lake is one such concept, which has melded log and ACID, bringing transaction and atomicity concepts into the ETL-analytics-big.data field, creating a revolution of sorts.

The problem(s):
Since traditional data warehousing, the design and modeling of analytics systems relied on denormalized tables, as analytics systems were considered separate from transactional systems. This started to change with the move to the cloud and availability of more real-time data. With the advent of big data technology like HDFS/Hadoop, additional constraints on updates and storage of relational datasets were added due to performance costs. The difficulty was particularly acute for cloud customers who faced additional latency compared to on-premises HDFS/Hadoop users. GDPR compliance meant deleting or correcting customer data required massive table-wide updates for a few records, with increased probability of data corruption and consistency issues in case of crashed updates.

There were a few common ways to work around these problems:

  • Partitioning relational data based on fields like dates, and using columnar data formats like Parquet or ORC and,
  • Using Apache Spark SQL, Presto or Hive for queries to apply an SQL abstraction on the underlying datasets. However these quickly became complicated to maintain in case of complex updates and led to its own performance issues.

Databricks develops the Delta Lake
The solution came in the form of the Delta Lake, an innovation from Databricks, which started offering it to its customers from 2017 and finally open sourced in 2019.
The Delta Lake is an ACID table storage layer on top of cloud object stores or the data lake. It adds ACID transactions and schema enforcement capabilities to data lakes.

  • The innovation was to maintain information about which objects are part of a Delta table in an ACID manner, using a write-ahead log, which itself is stored in the cloud data lake.
  • Each transaction is recorded as a new entry in the log, including metadata about the operation(CRUD) and the version of the data. For performance, the logs is compressed periodically into checkpoints. This design eliminates the need to maintain a separate metastore.
  • Delta lake supports the highest isolation level: serializable so that concurrently executing transactions appear to be serially executing.

Delta Lake table layout for the transaction log and data files
(Image adapted from Denny Lee)

Key features

  • Updates and Deletes - support for upserts (update + insert) and deletes.The delta log is crucial to support atomic ACID transactions.
  • Optimistic concurrency control - used to managed concurrent writes. Similar to an RDBMS, this supports maintaining isolation
  • Time travel - Use of versioned Parquet files allows data versioning and enables time travel or querying point-in-time snapshots and perform rollbacks.
  • Schema enforcement and evolution - allows reading old Parquet files without rewriting them if a table’s schema changes.
  • File-skipping - Delta Lake stores file-level metadata information: the min/max values of each column for each file in the table. This allows a powerful optimization called file skipping which can make some queries run 100 times faster!

The Lakehouse These features of the Delta Lake have popularized the concept of the lakehouse, something with combines some of the best features of data lakes and data warehouses. In essence, the Delta Lake sits in between data lakes and data warehouses.

ParameterData warehouseLakehouseData Lake
Type of dataStructured data onlyBoth structured and unstructured dataBoth structured and unstructured data
StorageRDBMS, usually rowsParquet, columnsVarious formats: HDFS/Avro/Parquet
Schema enforcementStrict schema-on-writeSchema-on-read, optional schema-on-writeSchema-on-read
ScalabilityComplex partitioningCan scale horizontallyCan scale horizontally
ConcurrencyNot optimized for updatesSupports concurrent writes and ACID transactionsNo transaction support, possible non-repeatable reads
ETL and CDCComplex ETL and CDC
SCD*SupportedSupported e.g. with upsertsNot supported natively

In the lakehouse architecture, data is ingested into the data lake using a schema-on-read approach. However, the Lakehouse also allows schema enforcement (schema-on-write) through ACID transactions and seamless schema evolution. This combination of schema-on-read and schema-on-write allows for a flexible and iterative approach to data exploration and analysis.

Streaming data Delta Lake provides a unified table format that can be used for both batch and streaming data processing.
Apache Kafka is often used to manage streaming pipelines or aggregate data in real-time, which could add management complexity and duplicate data. Delta Lake supports exactly-once streaming writes and could in many cases obviate the need for specialized real-time data processing systems.

Delta lakes and the Lakehouse architecture bridge the gap between data lakes and data warehouses with their best-of-both-worlds features. With increasing cloud adoption, startups like Dremio as well as all major cloud providers offer lakehouse-like services:

  • Amazon: AWS Redshift Spectrum
  • Microsoft: Azure Synapse + ADLS
  • Google: GCP BigQuery + BigLake

Open-source lakehouse spawns include Apache Iceberg and Apache Hudi.
Integration and orchestration platforms like Apache Airflow and Apache Nifi also offer enhanced capabilities within the lakehouse.

While originally developed by Databricks, Delta Lake is now an open-source project of the Linux Foundation, with multiple contributors like Amazon, Talend, Microsoft, ByteDance or Alibaba.

*SCD: Slowly changing dimension

References:

  1. Armbrust et al. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores | Paper:
  2. Databricks open sources Delta Lake
  3. Delta Lake file skipping
  4. Understanding the Delta Lake transaction log at the file level
 
comments powered by Disqus