Why Delta Lake?
Delta Lake turns ADLS Gen2 into something that behaves like a transactional database. It adds an ACID transaction layer, versioning, and schema control on top of Parquet files — all without giving up Spark's horizontal scaling.
Transaction log: the heart of Delta
Every Delta table has a _delta_log/ directory with JSON files recording each operation. This log is what enables ACID and time travel.
# View the full history of a table
from delta.tables import DeltaTable
delta_table = DeltaTable.forName(spark, "silver.orders")
delta_table.history().show(truncate=False)
Typical output:
version | timestamp | operation | operationParameters
--------|---------------------|-----------|--------------------
5 | 2026-04-01 08:00:00 | MERGE | {"predicate": "..."}
4 | 2026-03-31 08:00:00 | MERGE | {"predicate": "..."}
3 | 2026-03-30 08:00:00 | WRITE | {"mode": "Overwrite"}
Time travel
Query previous versions of the table — essential for debugging, auditing and rollback.
# By version
df_v3 = spark.read.format("delta") \
.option("versionAsOf", 3) \
.table("silver.orders")
# By timestamp
df_yesterday = spark.read.format("delta") \
.option("timestampAsOf", "2026-03-31") \
.table("silver.orders")
# Rollback to a previous version
delta_table.restoreToVersion(4)
Warning: time travel only works while old files still exist. After VACUUM, old versions are deleted.
OPTIMIZE and Z-ORDER
The small files problem
Frequent incremental ingestions create many small files, degrading read performance. OPTIMIZE consolidates them.
-- Compact all files in the table
OPTIMIZE silver.orders;
-- With Z-ORDER to optimize queries with frequent filters
OPTIMIZE silver.orders ZORDER BY (customer_id, order_date);
Z-ORDER physically reorganizes data so records with similar values in the specified fields are stored close together, which dramatically reduces data skipping on filter queries.
When to use Z-ORDER
- Fields frequently used in
WHEREclauses - High-cardinality fields (customer_id, product_id, date)
- Don't use on low-cardinality fields (boolean status, enum with 3 values)
Auto Optimize
Instead of running OPTIMIZE manually, enable it at the table level:
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
Or permanently on the table:
ALTER TABLE silver.orders
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
VACUUM: storage management
VACUUM deletes old files not referenced by the transaction log. By default it keeps 7 days (retention period).
-- See what would be deleted (dry run)
VACUUM silver.orders DRY RUN;
-- Run vacuum (keeps 7 days by default)
VACUUM silver.orders;
-- Reduce retention (caution: breaks time travel beyond this period)
VACUUM silver.orders RETAIN 48 HOURS;
Production rule: set up a weekly VACUUM routine. Without it, storage cost grows indefinitely.
Schema enforcement and evolution
Enforcement (default)
Delta automatically rejects data with incompatible schema:
# This will fail if df has extra columns or incompatible types
df.write.format("delta").mode("append").saveAsTable("silver.orders")
Schema evolution
To automatically accept new columns:
df.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.saveAsTable("silver.orders")
For overwrites with different schema:
df.write \
.format("delta") \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.saveAsTable("silver.orders")
Change Data Feed
Enable Change Data Feed to capture inserts, updates and deletes incrementally. This is useful for feeding downstream layers without a full scan:
ALTER TABLE silver.orders
SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
# Read only changes since version 5
changes = spark.read.format("delta") \
.option("readChangeData", "true") \
.option("startingVersion", 5) \
.table("silver.orders")
# _change_type: insert, update_preimage, update_postimage, delete
changes.filter("_change_type = 'update_postimage'").show()
Constraints and data quality
-- Disallow NULLs in order_id
ALTER TABLE silver.orders ADD CONSTRAINT nn_order_id CHECK (order_id IS NOT NULL);
-- Valid values for status
ALTER TABLE silver.orders ADD CONSTRAINT valid_status
CHECK (status IN ('pending', 'confirmed', 'shipped', 'delivered', 'cancelled'));
Constraints are checked on every write. Violations throw exceptions that stop the pipeline, which is exactly what you want — silent bad data is far worse than a failed job.