datarani

Source schemas change. A new column gets added to the CRM. A field gets renamed in the ERP. A type changes from VARCHAR to TEXT. In a traditional data warehouse, schema changes are coordinated events that require downtime. In a Lakehouse, they happen without warning and your pipeline needs to handle them gracefully.

Delta Lake has built-in schema evolution support, but its behavior isn't always intuitive. Understanding exactly what it does — and doesn't — do for you is essential before you trust it in production.

Schema enforcement: Delta's default

By default, Delta Lake enforces schema on write. If you try to write a DataFrame that doesn't match the existing table schema, the operation fails:

# Table schema: id BIGINT, name STRING, email STRING
# New data has an extra column: phone STRING

df_with_phone.write \
    .format("delta") \
    .mode("append") \
    .saveAsTable("bronze.customers")

# ERROR: A schema mismatch detected when writing to the Delta table.
# To enable schema migration using DataFrameWriter or DataStreamWriter,
# please set: .option("mergeSchema", "true")

This is the right default. Silent schema changes are worse than loud failures — at least a failure tells you something changed.

mergeSchema: safe schema evolution

mergeSchema allows adding new columns and widening existing column types (e.g., INT to BIGINT). It does not allow removing columns, renaming columns, or narrowing types.

df_with_phone.write \
    .format("delta") \
    .mode("append") \
    .option("mergeSchema", "true") \
    .saveAsTable("bronze.customers")

After this write, the table now has columns id, name, email, phone. Old rows will have NULL in the phone column. New rows will have the actual value.

You can also enable mergeSchema globally in the Spark session, which is useful for pipelines that frequently receive new columns:

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

What mergeSchema handles:

Adding new columns
Widening numeric types (INT → BIGINT, FLOAT → DOUBLE)
Adding new nested fields to struct columns

What mergeSchema does not handle:

Removing columns (old data retains them as nulls for new writes)
Renaming columns (treated as a drop + add)
Type narrowing (BIGINT → INT will fail)
Changing column types incompatibly (STRING → INT will fail)

overwriteSchema: the nuclear option

overwriteSchema completely replaces the table schema with the new DataFrame's schema. Use this for full-refresh writes when the schema has changed incompatibly:

df_new_schema.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .saveAsTable("bronze.customers")

This drops all existing data and schema, and replaces both with the new DataFrame. All Delta history is preserved, but the previous schema version is gone from the live table.

Use overwriteSchema sparingly — only for full-refresh pipelines where you're replacing everything. Never on incremental pipelines.

Schema evolution for streaming pipelines

For structured streaming pipelines, schema evolution behavior differs. By default, a streaming query will fail if the source schema changes. You can configure it to handle new columns:

df_stream = (
    spark.readStream
         .format("delta")
         .option("readChangeFeed", "true")
         .option("startingVersion", "latest")
         .table("bronze.customers_raw")
)

# Enable schema evolution for the streaming write
df_stream.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/mnt/checkpoints/customers_silver") \
    .option("mergeSchema", "true") \
    .outputMode("append") \
    .table("silver.customers")

For streaming sources with frequent schema changes — Kafka JSON payloads, Auto Loader with schema inference — use Auto Loader's schema evolution mode:

df = (
    spark.readStream
         .format("cloudFiles")
         .option("cloudFiles.format", "json")
         .option("cloudFiles.schemaLocation", "/mnt/schema/customers")
         .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
         .load("abfss://landing@storage.dfs.core.windows.net/customers/")
)

Auto Loader supports these schema evolution modes:

addNewColumns (recommended): add new columns, fail on type changes
rescue: put unmatched columns in a _rescued_data JSON column
failOnNewColumns: fail when new columns appear (strictest)
none: ignore schema changes entirely (most permissive, least safe)

Column mapping: renaming without data loss

Delta Lake Column Mapping (available since Delta 2.0) allows renaming and dropping columns without rewriting data. It maps logical column names to physical names in the Parquet files.

# Enable column mapping on the table
spark.sql("""
    ALTER TABLE silver.customers
    SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')
""")

# Now you can rename columns without rewriting data
spark.sql("ALTER TABLE silver.customers RENAME COLUMN email TO email_address")

# Or drop a column without rewriting data
spark.sql("ALTER TABLE silver.customers DROP COLUMN legacy_field")

The Parquet files on disk still have the original column names. Delta's transaction log maps email_address → email physically. This is O(1) — no data rewrite required.

Enable column mapping on new tables from the start if you expect frequent schema changes:

spark.sql("""
    CREATE TABLE silver.customers (
        customer_id BIGINT,
        email STRING,
        status STRING
    )
    USING DELTA
    TBLPROPERTIES (
        'delta.columnMapping.mode' = 'name',
        'delta.minReaderVersion' = '2',
        'delta.minWriterVersion' = '5'
    )
""")

Building resilient pipelines

For Bronze ingestion, never drop columns you don't recognize. Capture them instead:

from pyspark.sql import functions as F
import json

known_cols = ["customer_id", "email", "status", "updated_at"]

def rescue_unknown_cols(df, known_cols):
    unknown_cols = [c for c in df.columns if c not in known_cols]
    if not unknown_cols:
        return df.withColumn("_rescued_data", F.lit(None).cast("string"))
    rescue_struct = F.to_json(F.struct(*[F.col(c) for c in unknown_cols]))
    return (
        df.withColumn("_rescued_data", rescue_struct)
          .drop(*unknown_cols)
    )

bronze_df = rescue_unknown_cols(source_df, known_cols)

Unknown columns go into _rescued_data as a JSON string. Nothing is silently dropped. When your Silver transformation needs those columns, they're there.

Track table schema changes in version control as well. The Delta transaction log keeps schema history, but it's not designed for human review. A simple YAML file alongside your pipeline code is easier to reason about:

# schemas/customers.yaml
version: "3.0.0"
changelog:
  - version: "3.0.0"
    date: "2025-03-01"
    changes: "Added phone column, enabled column mapping"
  - version: "2.0.0"
    date: "2024-09-15"
    changes: "Renamed email to email_address"
  - version: "1.0.0"
    date: "2024-01-01"
    changes: "Initial schema"
columns:
  - name: customer_id
    type: bigint
    nullable: false
  - name: email_address
    type: string
    nullable: true
  - name: phone
    type: string
    nullable: true