datarani

What is the Gold layer?

The Gold layer is the consumption layer of your Lakehouse. It contains business-ready data: aggregations, dimensional models, and metrics optimized for BI tools, dashboards, and analyst queries.

Gold is where data engineering meets data modeling. Getting it right has an outsized impact on how much analysts trust your Lakehouse.

My recommendation: use Star Schema

After building Gold layers across several production Lakehouses, my strong recommendation is to model Gold using Star Schema.

Star Schema organizes data into fact tables (measurable events like orders, transactions, page views) and dimension tables (descriptive context like customers, products, regions, and dates).

                    ┌─────────────────┐
                    │   dim_customer  │
                    │  customer_id PK │
                    └────────┬────────┘
                             │
┌───────────────┐   ┌────────┴──────────┐   ┌─────────────────┐
│  dim_product  │   │    fact_orders    │   │   dim_date      │
│ product_id PK ├───│  order_id        │───│  date_id PK     │
└───────────────┘   │  customer_id FK  │   └─────────────────┘
                    │  product_id FK   │
                    │  date_id FK      │
                    │  quantity        │
                    │  revenue         │
                    │  discount        │
                    └──────────────────┘

Why Star Schema at Gold?

Query performance is the first reason. BI tools like Power BI and Tableau generate simple JOINs against Star Schema, and Databricks SQL optimizes these well with statistics and broadcast joins.

Analyst familiarity matters too. Most analysts understand dimensional modeling. A well-named fact table with clear foreign keys is largely self-documenting. New metrics can be added to fact tables without touching dimensions, and new slicing attributes only require dimension changes. And forcing yourself to define the grain of a fact table early prevents modeling mistakes that are expensive to fix later.

Avoid OBT — unless you truly need it

OBT (One Big Table) is the pattern where you pre-join everything, facts and all dimensions, into a single denormalized table.

OBT is tempting because it simplifies queries: SELECT SUM(revenue) FROM obt WHERE region = 'Brazil'. No JOINs needed.

But OBT has real costs in production. Dimension values are repeated for every fact row: a customer name stored 10 million times instead of once. If a dimension attribute changes (customer renamed, product recategorized), you need to rewrite every affected row in the OBT. Adding a new dimension attribute means altering a massive table, rewriting it, and invalidating caches. Full JOINs at write time also concentrate data on a few partitions, creating skew.

OBT is acceptable when the dataset is small (under 10M rows) and rarely updated, when you're feeding a tool that cannot JOIN, when you have a strict SLA that pre-joining is the only way to meet, or when you need a read-optimized snapshot for a single well-defined use case. In those situations, an OBT can coexist alongside your Star Schema, built from it, without replacing it.

Building the dimension tables

Dimensions come from Silver's conformed entities. They are slowly-changing by nature: a customer's email changes, a product's category changes.

# dim_customer — current state (SCD Type 1)
df_dim_customer = (
    spark.table("catalog_prod.silver.customers")
    .filter("is_deleted = false")
    .select(
        F.col("customer_id"),
        F.col("full_name"),
        F.col("email"),
        F.col("country"),
        F.col("segment"),
        F.col("created_at"),
        F.current_timestamp().alias("_refreshed_at"),
    )
)

(
    df_dim_customer.write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable("catalog_prod.gold.dim_customer")
)

For dimensions that need history (SCD Type 2), keep that in Silver and project the current snapshot into Gold. Gold should always expose the current view. Analytical history belongs in Silver.

# dim_date — static, generate once
from pyspark.sql.functions import sequence, explode, to_date, expr

df_dim_date = (
    spark.range(1)
    .select(explode(sequence(
        to_date(F.lit("2020-01-01")),
        to_date(F.lit("2030-12-31")),
        expr("interval 1 day")
    )).alias("date"))
    .select(
        F.date_format("date", "yyyyMMdd").cast("int").alias("date_id"),
        F.col("date").alias("full_date"),
        F.year("date").alias("year"),
        F.quarter("date").alias("quarter"),
        F.month("date").alias("month"),
        F.dayofmonth("date").alias("day"),
        F.date_format("date", "EEEE").alias("weekday_name"),
        F.dayofweek("date").alias("weekday_num"),
        (F.dayofweek("date").isin(1, 7)).alias("is_weekend"),
    )
)

(
    df_dim_date.write
    .format("delta")
    .mode("overwrite")
    .saveAsTable("catalog_prod.gold.dim_date")
)

Building the fact table

Facts are built by joining Silver events against dimension surrogate keys:

df_orders = spark.table("catalog_prod.silver.orders").filter("is_deleted = false")
df_customers = spark.table("catalog_prod.gold.dim_customer")
df_products = spark.table("catalog_prod.gold.dim_product")
df_dates = spark.table("catalog_prod.gold.dim_date")

df_fact_orders = (
    df_orders
    .join(df_customers, on="customer_id", how="left")
    .join(df_products, on="product_id", how="left")
    .join(
        df_dates,
        on=(F.date_format(df_orders.order_date, "yyyyMMdd").cast("int") == df_dates.date_id),
        how="left"
    )
    .select(
        df_orders.order_id,
        df_customers.customer_id,
        df_products.product_id,
        df_dates.date_id,
        df_orders.quantity,
        df_orders.total_amount.alias("revenue"),
        df_orders.discount_amount.alias("discount"),
        df_orders.status,
        df_orders._updated_at,
    )
)

(
    df_fact_orders.write
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .partitionBy("date_id")
    .saveAsTable("catalog_prod.gold.fact_orders")
)

Incremental refresh vs full overwrite

For large fact tables, full overwrite on every run is expensive. Use incremental refresh based on Silver's Change Data Feed:

# Read only changed rows from Silver since last Gold run
df_changes = (
    spark.readStream
    .format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", last_processed_version)   # store this in a checkpoint table
    .table("catalog_prod.silver.orders")
    .filter(F.col("_change_type").isin("insert", "update_postimage"))
)

Table properties for Gold

Gold tables are read-heavy. Optimize for scan performance:

ALTER TABLE catalog_prod.gold.fact_orders
SET TBLPROPERTIES (
    'delta.autoOptimize.optimizeWrite' = 'true',
    'delta.autoOptimize.autoCompact'   = 'true',
    'delta.enableChangeDataFeed'       = 'false',  -- Gold is a sink
    'delta.dataSkippingNumIndexedCols' = '8'       -- index key columns
);

-- Run OPTIMIZE + ZORDER on the most filtered columns
OPTIMIZE catalog_prod.gold.fact_orders
ZORDER BY (customer_id, date_id);

Exposing Gold to BI tools

In Databricks SQL, create a SQL warehouse and point Power BI or Tableau directly at catalog_prod.gold.*. Set the catalog as the default to simplify connection strings.

-- Recommended: create views for common aggregations
CREATE OR REPLACE VIEW catalog_prod.gold.vw_revenue_monthly AS
SELECT
    d.year,
    d.month,
    c.country,
    c.segment,
    SUM(f.revenue)               AS total_revenue,
    SUM(f.discount)              AS total_discount,
    COUNT(DISTINCT f.order_id)   AS order_count,
    COUNT(DISTINCT f.customer_id) AS unique_customers
FROM catalog_prod.gold.fact_orders f
JOIN catalog_prod.gold.dim_date     d ON f.date_id = d.date_id
JOIN catalog_prod.gold.dim_customer c ON f.customer_id = c.customer_id
GROUP BY d.year, d.month, c.country, c.segment;

Views are cheap and keep the underlying model flexible. Refactor fact and dimension tables without breaking BI connections.

Monitoring Gold freshness

-- Alert if Gold hasn't been refreshed in the last 3 hours
SELECT
    'fact_orders' AS table_name,
    MAX(_updated_at)  AS last_refresh,
    DATEDIFF(MINUTE, MAX(_updated_at), CURRENT_TIMESTAMP()) AS minutes_stale
FROM catalog_prod.gold.fact_orders
HAVING minutes_stale > 180;

Summary

| Decision | Recommendation | |---|---| | Modeling approach | Star Schema | | Grain definition | Define before writing any code | | OBT | Only when justified by a specific constraint | | Refresh strategy | Incremental via CDF for large facts, overwrite for dimensions | | BI exposure | SQL views over Gold tables, not direct table access | | Optimization | OPTIMIZE + ZORDER on query filter columns |

The Gold layer is the face of your Lakehouse. Model it poorly and analysts stop trusting the numbers. Model it well and they stop asking you to explain the data.

How to setup your Gold layer in Databricks