Architecture·2025-03-01·2 min read

Medallion Architecture in practice: Bronze, Silver and Gold with Delta Lake

How we structured the layers in the Lakehouse and why this decision changed our CDC ingestion.

What is Medallion Architecture?

Medallion Architecture is a data design pattern that organizes data into three progressive layers of quality and transformation. Each layer has a specific purpose and builds on top of the previous one.

The Three Layers

Bronze — Raw Ingestion

The Bronze layer is the landing zone for all raw data. Here we store data exactly as it arrives from the source systems, without any transformation. This includes:

  • CDC events from SQL Server
  • API responses in JSON format
  • CSV files from legacy systems

The key principle is never delete data from Bronze. This layer is your safety net — if something goes wrong downstream, you can always replay from here.

# Bronze ingestion example
df_bronze = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .load("/mnt/landing/orders/")

df_bronze.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/mnt/bronze/orders/_checkpoint") \
    .outputMode("append") \
    .table("bronze.orders")

Silver — Cleaned and Conformed

The Silver layer applies business rules, deduplication, and schema enforcement. This is where raw data becomes trusted data.

Key transformations at this layer:

  • Type casting and null handling
  • Deduplication using MERGE
  • Join with reference tables
  • Business rule validation
# Silver MERGE example
spark.sql("""
  MERGE INTO silver.orders AS target
  USING bronze.orders_staging AS source
  ON target.order_id = source.order_id
  WHEN MATCHED THEN UPDATE SET *
  WHEN NOT MATCHED THEN INSERT *
""")

Gold — Business Aggregations

The Gold layer contains business-ready aggregations optimized for consumption by BI tools, dashboards, and ML models.

# Gold aggregation example
df_gold = spark.sql("""
  SELECT 
    date_trunc('month', order_date) as month,
    product_category,
    sum(revenue) as total_revenue,
    count(distinct customer_id) as unique_customers
  FROM silver.orders
  GROUP BY 1, 2
""")

Why This Architecture Works

After implementing Medallion Architecture in production, we saw significant improvements:

  • Debugging became easier — when a number is wrong in Gold, we trace back through Silver to Bronze
  • Reprocessing is safe — we can drop and recreate Silver without losing source data
  • Teams work independently — data engineers own Bronze/Silver, analysts own Gold

Lessons Learned

The biggest mistake we made early on was trying to do too much in a single layer. We had pipelines jumping from raw JSON directly to business aggregations, which made debugging a nightmare.

Separating concerns into three layers — even when it feels like extra work — pays dividends in production stability.