datarani

What is Medallion Architecture?

Medallion Architecture is a data design pattern that organizes data into three progressive layers of quality and transformation. Each layer has a specific purpose and builds on top of the previous one.

The three layers

Bronze — raw ingestion

The Bronze layer is the landing zone for all raw data. Here we store data exactly as it arrives from the source systems, without any transformation. This includes:

CDC events from SQL Server
API responses in JSON format
CSV files from legacy systems

The key principle is never delete data from Bronze. This layer is your safety net. If something goes wrong downstream, you can always replay from here.

# Bronze ingestion example
df_bronze = spark.readStream \
    .format("cloudFiles") \
    .option("cloudFiles.format", "json") \
    .load("/mnt/landing/orders/")

df_bronze.writeStream \
    .format("delta") \
    .option("checkpointLocation", "/mnt/bronze/orders/_checkpoint") \
    .outputMode("append") \
    .table("bronze.orders")

Silver — cleaned and conformed

The Silver layer applies business rules, deduplication, and schema enforcement. This is where raw data becomes trusted data.

Key transformations at this layer:

Type casting and null handling
Deduplication using MERGE
Join with reference tables
Business rule validation

# Silver MERGE example
spark.sql("""
  MERGE INTO silver.orders AS target
  USING bronze.orders_staging AS source
  ON target.order_id = source.order_id
  WHEN MATCHED THEN UPDATE SET *
  WHEN NOT MATCHED THEN INSERT *
""")

Gold — business aggregations

The Gold layer contains business-ready aggregations optimized for consumption by BI tools, dashboards, and ML models.

# Gold aggregation example
df_gold = spark.sql("""
  SELECT 
    date_trunc('month', order_date) as month,
    product_category,
    sum(revenue) as total_revenue,
    count(distinct customer_id) as unique_customers
  FROM silver.orders
  GROUP BY 1, 2
""")

Why this architecture works

After implementing Medallion Architecture in production, we saw significant improvements. When a number is wrong in Gold, we trace it back through Silver to Bronze — that traceability alone saves hours of debugging. Reprocessing is safe because we can drop and recreate Silver without losing source data. And teams can work independently: data engineers own Bronze and Silver, analysts own Gold.

Lessons learned

The biggest mistake we made early on was trying to do too much in a single layer. We had pipelines jumping from raw JSON directly to business aggregations, which made debugging a nightmare.

Separating concerns into distinct layers, even when it feels like extra work, pays dividends in production stability.