datarani

What is the Bronze layer?

The Bronze layer is the raw landing zone of your Lakehouse. It stores data exactly as it arrives from source systems: no transformations, no business rules, no filtering. Its only job is to persist the raw signal reliably.

Think of Bronze as your audit log. When something breaks downstream, Bronze is where you come to replay.

Core principles

Never delete data from Bronze — append-only by default
Preserve source fidelity — store data as-is, including nulls, bad formats, and duplicates
Add ingestion metadata — timestamp, source file, batch ID
Schema-on-read — let downstream layers enforce schema

Folder and catalog structure

Before writing any code, define your structure in Unity Catalog:

-- Create the Bronze catalog and schema
CREATE CATALOG IF NOT EXISTS catalog_prod;
CREATE SCHEMA IF NOT EXISTS catalog_prod.bronze;

Recommended storage layout on ADLS Gen2:

/lakehouse/
  bronze/
    orders/
      _checkpoint/
      year=2026/month=04/day=10/
    customers/
      _checkpoint/
    events/
      _checkpoint/

Ingesting with AutoLoader

AutoLoader is the recommended way to ingest files into Bronze. It handles incremental ingestion natively: only new files are processed on each run.

# Bronze ingestion with AutoLoader
df_raw = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/lakehouse/bronze/orders/_schema")
    .option("cloudFiles.inferColumnTypes", "false")   # keep everything as string
    .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
    .load("abfss://landing@<storage>.dfs.core.windows.net/orders/")
)

# Add ingestion metadata
from pyspark.sql import functions as F

df_bronze = df_raw.withColumns({
    "_ingested_at": F.current_timestamp(),
    "_source_file": F.col("_metadata.file_path"),
    "_batch_id":    F.col("_metadata.file_modification_time").cast("string"),
})

# Write to Delta
(
    df_bronze.writeStream
    .format("delta")
    .option("checkpointLocation", "/lakehouse/bronze/orders/_checkpoint")
    .option("mergeSchema", "true")
    .outputMode("append")
    .trigger(availableNow=True)   # batch mode — process all pending files and stop
    .toTable("catalog_prod.bronze.orders")
)

The availableNow=True trigger is what you want for scheduled jobs: it processes all pending files and exits cleanly, unlike trigger(once=True) which was deprecated.

Schema evolution

Source systems change. A column gets added, renamed, or its type flips. AutoLoader with schemaEvolutionMode = addNewColumns handles new columns automatically. For type changes, set rescue mode:

.option("cloudFiles.schemaEvolutionMode", "rescue")

This writes rows with unexpected schema changes to a _rescued_data column instead of killing the job. You can inspect them later:

SELECT _rescued_data, _source_file, _ingested_at
FROM catalog_prod.bronze.orders
WHERE _rescued_data IS NOT NULL
ORDER BY _ingested_at DESC
LIMIT 100;

Handling CDC events (Debezium / Kafka)

If your source is a CDC stream (e.g., Debezium via Azure Event Hubs), Bronze still stores everything raw:

from pyspark.sql.types import StringType

df_cdc_raw = (
    spark.readStream
    .format("eventhubs")
    .options(**eventhubs_conf)
    .load()
    .withColumn("body", F.col("body").cast(StringType()))
    .withColumn("_ingested_at", F.current_timestamp())
    .withColumn("_partition", F.col("partition"))
    .withColumn("_offset", F.col("offset"))
)

(
    df_cdc_raw.writeStream
    .format("delta")
    .option("checkpointLocation", "/lakehouse/bronze/orders_cdc/_checkpoint")
    .outputMode("append")
    .trigger(processingTime="60 seconds")
    .toTable("catalog_prod.bronze.orders_cdc")
)

Deserialization of the CDC envelope happens in Silver, not here.

Table properties for Bronze

Set these properties at table creation to optimize Bronze for append-heavy workloads:

ALTER TABLE catalog_prod.bronze.orders
SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact'   = 'true',
    'delta.enableChangeDataFeed'       = 'false',  -- not needed on Bronze
    'delta.logRetentionDuration'       = 'interval 90 days',
    'delta.deletedFileRetentionDuration' = 'interval 7 days'
);

Partitioning

For high-volume Bronze tables, partition by ingestion date to limit scan scope when replaying:

(
    df_bronze.writeStream
    .format("delta")
    .partitionBy("_ingestion_date")   # derived from _ingested_at
    .option("checkpointLocation", "/lakehouse/bronze/orders/_checkpoint")
    .outputMode("append")
    .toTable("catalog_prod.bronze.orders")
)

Avoid over-partitioning. If daily volume is under 1 GB, partitioning by date creates more overhead than it saves.

Unity Catalog tags and comments

Document Bronze tables so everyone knows what they contain:

COMMENT ON TABLE catalog_prod.bronze.orders IS
    'Raw orders from SQL Server via Debezium CDC. Append-only. No transformations applied.';

ALTER TABLE catalog_prod.bronze.orders
ALTER COLUMN _ingested_at COMMENT 'Timestamp when the record was written to Bronze';

ALTER TABLE catalog_prod.bronze.orders
ALTER COLUMN _source_file COMMENT 'Source file path from ADLS Gen2 landing zone';

Monitoring Bronze health

A Bronze layer that silently stops ingesting is worse than one that fails loudly. Track these metrics:

-- Last ingestion time per table
SELECT
    MAX(_ingested_at) AS last_ingested,
    DATEDIFF(MINUTE, MAX(_ingested_at), CURRENT_TIMESTAMP()) AS minutes_since_last_write,
    COUNT(*) AS records_today
FROM catalog_prod.bronze.orders
WHERE DATE(_ingested_at) = CURRENT_DATE();

Set a Databricks SQL Alert on minutes_since_last_write > 60 to catch stale pipelines early.

What Bronze is NOT

Bronze does not deduplicate records
Bronze does not apply business rules
Bronze does not enforce types beyond what AutoLoader infers
Bronze does not JOIN with other tables

All of that is Silver's job. Keep Bronze simple. Its reliability is its value.

How to setup your Bronze layer in Databricks