datarani

The single most common performance problem I see in Delta Lake tables is bad partitioning. Either too many partitions (small files problem) or too few (full table scans on filtered queries). Both are easy to create and surprisingly expensive to fix once a table is in production.

This article covers how to think about partitioning correctly before writing your first row, and what to do when you need to fix a table that was designed wrong.

What partitioning does

When you partition a Delta table, Databricks stores rows with the same partition column value in the same directory on ADLS. A query with a filter on the partition column only reads those directories — Databricks never opens files in other partitions.

# Without partitioning: every query reads all files
df.write.format("delta").saveAsTable("silver.events")

# With date partitioning: queries filtered by date read only relevant files
df.write \
  .format("delta") \
  .partitionBy("event_date") \
  .saveAsTable("silver.events")

For a table with 3 years of events, a query WHERE event_date = '2025-03-01' on a date-partitioned table reads 1/1095 of the data. On an unpartitioned table, it reads everything.

The small files trap

The most common partitioning mistake is over-partitioning — creating partitions with too few files or files that are too small (under ~100MB each).

Every Parquet file has fixed overhead: metadata reads, file open operations, S3/ADLS API calls. When a query needs to read 10,000 files of 5MB each, that overhead dominates execution time. The same data in 100 files of 500MB each is dramatically faster.

How over-partitioning happens:

# BAD: partitioning by a high-cardinality column
# If you have 1M customers, this creates 1M partition directories
df.write.partitionBy("customer_id").saveAsTable("silver.customers")

# BAD: partitioning by timestamp (each unique second = one partition)
df.write.partitionBy("event_timestamp").saveAsTable("silver.events")

# GOOD: partition by date (reasonable cardinality)
df.withColumn("event_date", F.to_date("event_timestamp")) \
  .write.partitionBy("event_date").saveAsTable("silver.events")

Rule of thumb: Each partition should contain at least 128MB of data. If your daily partition is 5MB, you're over-partitioned — consider weekly or monthly partitioning, or no partitioning at all.

Z-Ordering: the alternative for high-cardinality columns

For columns you filter on frequently but that have too many distinct values to partition by (customer_id, product_id, user_id), use Z-Ordering instead of partitioning:

spark.sql("""
    OPTIMIZE silver.events
    ZORDER BY (customer_id, event_type)
""")

Z-Ordering co-locates rows with similar values in the same files using a space-filling curve. It doesn't partition the data — all files remain in a single directory — but Delta Lake's data skipping can skip large portions of the table when you filter on Z-ordered columns.

Z-Ordering works best when:

The column has high cardinality (thousands to millions of distinct values)
Queries frequently filter on this column in combination with partition filters
Files are large enough that skipping within a file matters (100MB+)

Combining partition and Z-Order is the most powerful pattern:

# Partition by date (coarse filter), Z-Order by customer_id (fine filter)
spark.sql("""
    OPTIMIZE silver.events
    WHERE event_date >= '2025-01-01'
    ZORDER BY (customer_id, event_type)
""")

A query like WHERE event_date = '2025-03-01' AND customer_id = 12345 first skips 99.9% of the table via partition pruning, then skips most of the remaining files via Z-Order data skipping.

Liquid Clustering: the modern alternative

Databricks introduced Liquid Clustering as a replacement for static partitioning. Instead of pre-declaring partition columns at table creation, you define clustering columns and Databricks manages file layout automatically.

# Create table with liquid clustering
spark.sql("""
    CREATE TABLE silver.events (
        event_id    BIGINT,
        customer_id BIGINT,
        event_type  STRING,
        event_date  DATE,
        payload     STRING
    )
    USING DELTA
    CLUSTER BY (event_date, customer_id)
""")

The key advantage: you can change clustering columns without rewriting the table.

# Change clustering columns on an existing table
spark.sql("ALTER TABLE silver.events CLUSTER BY (customer_id, event_type)")

With static partitioning, changing the partition column requires a full table rewrite. With Liquid Clustering, the next OPTIMIZE run applies the new clustering incrementally.

Prefer Liquid Clustering over static partitioning when your query patterns change over time, when you're creating new tables and want operational flexibility, or when your partition column has irregular distribution (some dates have 10x more data than others).

How to fix a badly partitioned table

You have a production table partitioned wrong. Here are your options, from least to most disruptive.

Option 1: Add Z-Ordering (non-destructive). If the table has no partitioning or is partitioned on the right column but the wrong cardinality, Z-Ordering can compensate without changing the table structure:

spark.sql("OPTIMIZE silver.events ZORDER BY (customer_id)")

Option 2: Change to Liquid Clustering. If you're on Databricks Runtime 13.3+, migrate to Liquid Clustering:

spark.sql("ALTER TABLE silver.events CLUSTER BY (event_date, customer_id)")
# Next OPTIMIZE will start applying the new clustering
spark.sql("OPTIMIZE silver.events")

Option 3: Full rewrite (most disruptive but cleanest result). Create a new table with the correct partitioning, backfill it, and swap:

# Create new table with correct partitioning
spark.sql("""
    CREATE TABLE silver.events_v2
    USING DELTA
    PARTITIONED BY (event_date)
    LOCATION 'abfss://silver@storage.dfs.core.windows.net/events_v2'
    AS SELECT *, TO_DATE(event_timestamp) AS event_date
    FROM silver.events
""")

# After validation, drop old and rename
spark.sql("DROP TABLE silver.events")
spark.sql("ALTER TABLE silver.events_v2 RENAME TO silver.events")

The rewrite requires downtime or a cutover window. Optimize it by writing the historical data first, then processing only the delta since the backfill started.

A decision framework

When deciding on partitioning strategy for a new table, answer these questions in order:

What columns do queries filter on most frequently? Those are your partition/cluster candidates.
What is the cardinality of each candidate column?
- Low cardinality (fewer than 1000 distinct values, e.g., status, region, date): good for static partitioning
- High cardinality (more than 10,000 distinct values, e.g., customer_id): use Z-Order, not partitioning
How much data is in each partition?
- Target: 100MB to 1GB per partition file
- Less than 100MB per partition: too granular, consider coarser granularity or no partitioning
- More than 10GB per partition: fine for batch, might be slow for point queries — add Z-Ordering
Will query patterns change?
- Yes: prefer Liquid Clustering
- No: static partitioning is simpler and sufficient
Are write patterns append-only or do you have frequent upserts?
- Frequent upserts into specific partitions cause data skew — Liquid Clustering handles this better