datarani

Why Delta Lake?

Delta Lake turns ADLS Gen2 into something that behaves like a transactional database. It adds an ACID transaction layer, versioning, and schema control on top of Parquet files — all without giving up Spark's horizontal scaling.

Transaction log: the heart of Delta

Every Delta table has a _delta_log/ directory with JSON files recording each operation. This log is what enables ACID and time travel.

# View the full history of a table
from delta.tables import DeltaTable

delta_table = DeltaTable.forName(spark, "silver.orders")
delta_table.history().show(truncate=False)

Typical output:

version | timestamp           | operation | operationParameters
--------|---------------------|-----------|--------------------
5       | 2026-04-01 08:00:00 | MERGE     | {"predicate": "..."}
4       | 2026-03-31 08:00:00 | MERGE     | {"predicate": "..."}
3       | 2026-03-30 08:00:00 | WRITE     | {"mode": "Overwrite"}

Time travel

Query previous versions of the table — essential for debugging, auditing and rollback.

# By version
df_v3 = spark.read.format("delta") \
    .option("versionAsOf", 3) \
    .table("silver.orders")

# By timestamp
df_yesterday = spark.read.format("delta") \
    .option("timestampAsOf", "2026-03-31") \
    .table("silver.orders")

# Rollback to a previous version
delta_table.restoreToVersion(4)

Warning: time travel only works while old files still exist. After VACUUM, old versions are deleted.

OPTIMIZE and Z-ORDER

The small files problem

Frequent incremental ingestions create many small files, degrading read performance. OPTIMIZE consolidates them.

-- Compact all files in the table
OPTIMIZE silver.orders;

-- With Z-ORDER to optimize queries with frequent filters
OPTIMIZE silver.orders ZORDER BY (customer_id, order_date);

Z-ORDER physically reorganizes data so records with similar values in the specified fields are stored close together, which dramatically reduces data skipping on filter queries.

When to use Z-ORDER

Fields frequently used in WHERE clauses
High-cardinality fields (customer_id, product_id, date)
Don't use on low-cardinality fields (boolean status, enum with 3 values)

Auto Optimize

Instead of running OPTIMIZE manually, enable it at the table level:

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

Or permanently on the table:

ALTER TABLE silver.orders 
SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact' = 'true'
);

VACUUM: storage management

VACUUM deletes old files not referenced by the transaction log. By default it keeps 7 days (retention period).

-- See what would be deleted (dry run)
VACUUM silver.orders DRY RUN;

-- Run vacuum (keeps 7 days by default)
VACUUM silver.orders;

-- Reduce retention (caution: breaks time travel beyond this period)
VACUUM silver.orders RETAIN 48 HOURS;

Production rule: set up a weekly VACUUM routine. Without it, storage cost grows indefinitely.

Schema enforcement and evolution

Enforcement (default)

Delta automatically rejects data with incompatible schema:

# This will fail if df has extra columns or incompatible types
df.write.format("delta").mode("append").saveAsTable("silver.orders")

Schema evolution

To automatically accept new columns:

df.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable("silver.orders")

For overwrites with different schema:

df.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("silver.orders")

Change Data Feed

Enable Change Data Feed to capture inserts, updates and deletes incrementally. This is useful for feeding downstream layers without a full scan:

ALTER TABLE silver.orders 
SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');

# Read only changes since version 5
changes = spark.read.format("delta") \
    .option("readChangeData", "true") \
    .option("startingVersion", 5) \
    .table("silver.orders")

# _change_type: insert, update_preimage, update_postimage, delete
changes.filter("_change_type = 'update_postimage'").show()

Constraints and data quality

-- Disallow NULLs in order_id
ALTER TABLE silver.orders ADD CONSTRAINT nn_order_id CHECK (order_id IS NOT NULL);

-- Valid values for status
ALTER TABLE silver.orders ADD CONSTRAINT valid_status 
    CHECK (status IN ('pending', 'confirmed', 'shipped', 'delivered', 'cancelled'));

Constraints are checked on every write. Violations throw exceptions that stop the pipeline. Silent bad data is far worse than a failed job.

Delta Lake beyond the basics: time travel, OPTIMIZE and data quality