What is Medallion Architecture?
Medallion Architecture is a data design pattern that organizes data into three progressive layers of quality and transformation. Each layer has a specific purpose and builds on top of the previous one.
The Three Layers
Bronze — Raw Ingestion
The Bronze layer is the landing zone for all raw data. Here we store data exactly as it arrives from the source systems, without any transformation. This includes:
- CDC events from SQL Server
- API responses in JSON format
- CSV files from legacy systems
The key principle is never delete data from Bronze. This layer is your safety net — if something goes wrong downstream, you can always replay from here.
# Bronze ingestion example
df_bronze = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.load("/mnt/landing/orders/")
df_bronze.writeStream \
.format("delta") \
.option("checkpointLocation", "/mnt/bronze/orders/_checkpoint") \
.outputMode("append") \
.table("bronze.orders")
Silver — Cleaned and Conformed
The Silver layer applies business rules, deduplication, and schema enforcement. This is where raw data becomes trusted data.
Key transformations at this layer:
- Type casting and null handling
- Deduplication using MERGE
- Join with reference tables
- Business rule validation
# Silver MERGE example
spark.sql("""
MERGE INTO silver.orders AS target
USING bronze.orders_staging AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
Gold — Business Aggregations
The Gold layer contains business-ready aggregations optimized for consumption by BI tools, dashboards, and ML models.
# Gold aggregation example
df_gold = spark.sql("""
SELECT
date_trunc('month', order_date) as month,
product_category,
sum(revenue) as total_revenue,
count(distinct customer_id) as unique_customers
FROM silver.orders
GROUP BY 1, 2
""")
Why This Architecture Works
After implementing Medallion Architecture in production, we saw significant improvements:
- Debugging became easier — when a number is wrong in Gold, we trace back through Silver to Bronze
- Reprocessing is safe — we can drop and recreate Silver without losing source data
- Teams work independently — data engineers own Bronze/Silver, analysts own Gold
Lessons Learned
The biggest mistake we made early on was trying to do too much in a single layer. We had pipelines jumping from raw JSON directly to business aggregations, which made debugging a nightmare.
Separating concerns into three layers — even when it feels like extra work — pays dividends in production stability.