What is Medallion Architecture?
Medallion Architecture is a data design pattern that organizes data into three progressive layers of quality and transformation. Each layer has a specific purpose and builds on top of the previous one.
The three layers
Bronze — raw ingestion
The Bronze layer is the landing zone for all raw data. Here we store data exactly as it arrives from the source systems, without any transformation. This includes:
- CDC events from SQL Server
- API responses in JSON format
- CSV files from legacy systems
The key principle is never delete data from Bronze. This layer is your safety net. If something goes wrong downstream, you can always replay from here.
# Bronze ingestion example
df_bronze = spark.readStream \
.format("cloudFiles") \
.option("cloudFiles.format", "json") \
.load("/mnt/landing/orders/")
df_bronze.writeStream \
.format("delta") \
.option("checkpointLocation", "/mnt/bronze/orders/_checkpoint") \
.outputMode("append") \
.table("bronze.orders")
Silver — cleaned and conformed
The Silver layer applies business rules, deduplication, and schema enforcement. This is where raw data becomes trusted data.
Key transformations at this layer:
- Type casting and null handling
- Deduplication using MERGE
- Join with reference tables
- Business rule validation
# Silver MERGE example
spark.sql("""
MERGE INTO silver.orders AS target
USING bronze.orders_staging AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
Gold — business aggregations
The Gold layer contains business-ready aggregations optimized for consumption by BI tools, dashboards, and ML models.
# Gold aggregation example
df_gold = spark.sql("""
SELECT
date_trunc('month', order_date) as month,
product_category,
sum(revenue) as total_revenue,
count(distinct customer_id) as unique_customers
FROM silver.orders
GROUP BY 1, 2
""")
Why this architecture works
After implementing Medallion Architecture in production, we saw significant improvements. When a number is wrong in Gold, we trace it back through Silver to Bronze — that traceability alone saves hours of debugging. Reprocessing is safe because we can drop and recreate Silver without losing source data. And teams can work independently: data engineers own Bronze and Silver, analysts own Gold.
Lessons learned
The biggest mistake we made early on was trying to do too much in a single layer. We had pipelines jumping from raw JSON directly to business aggregations, which made debugging a nightmare.
Separating concerns into distinct layers, even when it feels like extra work, pays dividends in production stability.