The bill that made everyone pay attention
Six months after the data platform team scaled from 3 to 11 engineers, the monthly Databricks invoice had tripled. Workloads had roughly doubled. Nobody could explain the gap.
That's the pattern I see in almost every mid-sized data organization: teams grow, pipelines multiply, clusters proliferate. Cost tracking stays at zero until finance sends a screenshot of the monthly bill. FinOps for data engineering means knowing where every DBU goes before someone else asks you.
This is the playbook we ran to reduce that bill by 60% over eight weeks, without touching a single SLA.
What FinOps actually means for data teams
FinOps is a practice, not a tool. The Cloud FinOps Foundation defines it as the intersection of finance, engineering, and business. For data teams in practice, that means three things: visibility into what you're spending and why, governance over who can spend what, and continuous optimization to reduce waste without reducing capability.
The main difference from regular cloud cost tracking is what you're actually measuring. In Databricks, you don't pay per API call or per GB stored. You pay in DBUs (Databricks Units), which vary by cluster type, instance family, and workload type — Jobs Compute, SQL Compute, and All-Purpose Compute each have different rates. A Standard_DS4_v2 running as a job cluster consumes DBUs at a different rate than the same machine running as an all-purpose cluster.
That's why System Tables come first.
1. Diagnosing cost drivers with System Tables
Start here before touching anything else.
Enabling the billing schema
-- Run as account admin
ALTER SYSTEM ENABLE SCHEMA system.billing;
ALTER SYSTEM ENABLE SCHEMA system.compute;
Finding your top cost drivers
-- Top 20 clusters by DBU consumption in the last 30 days
SELECT
c.cluster_name,
c.cluster_source, -- JOB, UI, API, PIPELINE
c.created_by,
SUM(u.usage_quantity) AS total_dbus,
SUM(u.usage_quantity * lp.default) AS estimated_cost_usd,
COUNT(DISTINCT u.usage_date) AS active_days,
SUM(u.usage_quantity) / COUNT(DISTINCT u.usage_date) AS avg_daily_dbus
FROM system.billing.usage u
JOIN system.billing.list_prices lp
ON u.sku_name = lp.sku_name AND u.cloud = lp.cloud
LEFT JOIN system.compute.clusters c
ON u.usage_metadata.cluster_id = c.cluster_id
WHERE u.usage_date >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY 1, 2, 3
ORDER BY total_dbus DESC
LIMIT 20;
When we ran this query for the first time, 70% of DBU consumption came from all-purpose clusters left running between jobs. Nobody had set auto-termination.
Separating job clusters from all-purpose
-- Cost breakdown by cluster source
SELECT
COALESCE(c.cluster_source, 'UNKNOWN') AS cluster_type,
SUM(u.usage_quantity) AS total_dbus,
ROUND(SUM(u.usage_quantity * lp.default), 2) AS estimated_cost_usd,
ROUND(
100.0 * SUM(u.usage_quantity) /
SUM(SUM(u.usage_quantity)) OVER (),
1
) AS pct_of_total
FROM system.billing.usage u
JOIN system.billing.list_prices lp
ON u.sku_name = lp.sku_name AND u.cloud = lp.cloud
LEFT JOIN system.compute.clusters c
ON u.usage_metadata.cluster_id = c.cluster_id
WHERE u.usage_date >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY 1
ORDER BY total_dbus DESC;
Identifying idle clusters
-- Clusters with long idle windows (no jobs, still running)
SELECT
cluster_id,
cluster_name,
created_by,
autotermination_minutes,
DATEDIFF(HOUR, last_activity_time, current_timestamp()) AS idle_hours
FROM system.compute.clusters
WHERE
state = 'RUNNING'
AND last_activity_time < CURRENT_TIMESTAMP - INTERVAL 2 HOURS
AND cluster_source = 'UI' -- interactive clusters
ORDER BY idle_hours DESC;
In our case, this query turned up 9 interactive clusters sitting idle for over 4 hours, most without auto-termination configured.
2. Cluster Policies — the most underrated lever
Cluster Policies let you define constraints that apply to all clusters in a workspace. Engineers can still create clusters, but only within what the policy allows. It requires no infrastructure changes, no migration, no code refactor — just a JSON config pushed to the workspace.
Policy for job clusters
This policy forces job clusters to be ephemeral, right-sized, and spot-backed:
{
"node_type_id": {
"type": "allowlist",
"values": ["Standard_DS3_v2", "Standard_DS4_v2", "Standard_D8s_v3"],
"defaultValue": "Standard_DS4_v2"
},
"driver_node_type_id": {
"type": "fixed",
"value": "Standard_DS3_v2"
},
"autoscale.min_workers": {
"type": "range",
"minValue": 1,
"maxValue": 4,
"defaultValue": 2
},
"autoscale.max_workers": {
"type": "range",
"minValue": 2,
"maxValue": 16,
"defaultValue": 8
},
"azure_attributes.availability": {
"type": "fixed",
"value": "SPOT_WITH_FALLBACK_AZURE"
},
"azure_attributes.spot_bid_max_price": {
"type": "fixed",
"value": -1
},
"cluster_log_conf.dbfs.destination": {
"type": "fixed",
"value": "dbfs:/cluster-logs"
},
"custom_tags.cost_center": {
"type": "regex",
"pattern": "^[a-z]{2,8}-[a-z]{2,8}$",
"isOptional": false
}
}
Don't make cost_center optional. Without it, chargeback is impossible — you'll never know which team ran what.
Policy for all-purpose clusters
Interactive clusters are where the most waste accumulates. They need tight auto-termination and instance limits:
{
"node_type_id": {
"type": "allowlist",
"values": ["Standard_DS3_v2", "Standard_DS4_v2"],
"defaultValue": "Standard_DS3_v2"
},
"num_workers": {
"type": "range",
"minValue": 0,
"maxValue": 4,
"defaultValue": 2
},
"autotermination_minutes": {
"type": "range",
"minValue": 10,
"maxValue": 60,
"defaultValue": 30
},
"azure_attributes.availability": {
"type": "fixed",
"value": "ON_DEMAND_AZURE"
},
"custom_tags.owner": {
"type": "regex",
"pattern": "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$",
"isOptional": false
}
}
All-purpose clusters use on-demand intentionally. Interactive work is unpredictable, and spot interruptions are disruptive to engineers — they rarely save as much as they do on batch.
Creating policies via the SDK
from databricks.sdk import WorkspaceClient
import json
w = WorkspaceClient()
with open("policies/job-cluster-policy.json") as f:
policy_definition = json.load(f)
policy = w.cluster_policies.create(
name="Job Clusters — Standard",
definition=json.dumps(policy_definition),
description="Enforces spot instances, auto-scaling, and cost tagging for all job clusters"
)
print(f"Policy created: {policy.policy_id}")
After rolling out these two policies, we saw a 25% cost reduction in the first two weeks, purely from auto-termination of idle clusters and forced tagging.
3. Spot Instances and Autoscaling — getting the combination right
Spot instances can cut compute costs 60–80% on batch workloads, but the savings only hold if you pick the right workloads.
When to use spot (and when not to)
| Workload type | Spot? | Reasoning | |---|---|---| | Batch ETL (idempotent MERGE) | Yes | On interruption, job restarts from checkpoint | | Delta Live Tables | Yes | DLT manages recovery internally | | Structured Streaming | No | Interruption causes processing lag, potential duplicate handling | | Interactive exploration | No | User loses work and context | | ML training (non-checkpoint) | No | Job restarts from zero | | ML training (with checkpoints) | Yes | MLflow or custom checkpointing handles recovery |
SPOT_WITH_FALLBACK vs SPOT
SPOT_WITH_FALLBACK_AZURE (or SPOT_WITH_FALLBACK on AWS) is almost always the right choice for job clusters. Pure spot fails the job if spot capacity is unavailable. With fallback, Azure automatically provisions on-demand instances if spot is exhausted in the region.
The cost difference between the two is negligible over a month — spot interruptions rarely affect more than 5–10% of capacity. The reliability difference is not.
Autoscaling configuration that actually works
Databricks autoscaling scales up quickly and scales down slowly. That means you often have more workers than you need for the tail of a job. Two config changes help:
# Set via Spark config on the cluster or job
spark.conf.set("spark.databricks.aggressiveWindowDownS", "60") # Scale down after 60s idle (default: 300s)
spark.conf.set("spark.databricks.scaleUpConsecutiveSuccessTarget", "1") # Scale up after 1 consecutive need
You can also set these in the cluster configuration JSON:
{
"spark_conf": {
"spark.databricks.aggressiveWindowDownS": "60",
"spark.databricks.scaleUpConsecutiveSuccessTarget": "1"
}
}
For batch ETL jobs that process data in defined stages, aggressive scale-down (60–120s) saves 15–25% of compute versus defaults. For streaming or continuous jobs, keep the default.
Right-sizing your instances
The most common mistake is over-provisioning instance size. A Standard_DS5_v2 (8 vCPUs, 56 GB RAM) costs 2x a Standard_DS4_v2 (8 vCPUs, 28 GB RAM). If your job isn't memory-bound, you're paying for RAM you'll never use.
-- Check peak memory usage per job (requires Ganglia or Spark metrics)
SELECT
run_id,
job_id,
task_key,
MAX(metrics.worker_memory_used_gb) AS peak_memory_gb,
MAX(metrics.worker_memory_total_gb) AS allocated_memory_gb,
ROUND(
100.0 * MAX(metrics.worker_memory_used_gb) /
NULLIF(MAX(metrics.worker_memory_total_gb), 0),
1
) AS memory_utilization_pct
FROM system.lakeflow.task_runs
WHERE start_time >= CURRENT_TIMESTAMP - INTERVAL 14 DAYS
GROUP BY 1, 2, 3
HAVING memory_utilization_pct < 50 -- candidates for downsize
ORDER BY allocated_memory_gb DESC;
4. Job and Pipeline Optimization
Cluster configuration gets you most of the way there. Spark-level settings add another 10–20%.
Adaptive Query Execution
The default spark.sql.shuffle.partitions = 200 is wrong for almost every workload. A small job generating 200 shuffle files pays unnecessary overhead. A large job is under-partitioned. Enable AQE and let Databricks handle it:
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.minPartitionSize", "64MB")
spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "128MB")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
For Databricks Runtime 10.4+, AQE is enabled by default. If you're on an older runtime, upgrading alone can reduce job duration by 20–30%.
Delta Lake optimizations that reduce compute
File reads and shuffles are money. Delta has a few settings that cut both:
from delta.tables import DeltaTable
# Auto-optimize: writes smaller, better-organized files automatically
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
# Run OPTIMIZE on high-read tables to reduce files scanned per query
dt = DeltaTable.forName(spark, "catalog_prod.silver.orders")
dt.optimize().executeZOrderBy("customer_id", "order_date")
For tables frequently joined or filtered on the same columns, liquid clustering (DBR 13.3+) is more effective than ZORDER:
-- Convert to liquid clustering (replaces partition + ZORDER)
ALTER TABLE catalog_prod.silver.orders
CLUSTER BY (customer_id, order_date);
-- Databricks will rewrite incrementally during OPTIMIZE
OPTIMIZE catalog_prod.silver.orders;
Avoiding unnecessary recomputation
Reading the same data multiple times in a pipeline is one of the more expensive things you can do silently. Cache intermediate DataFrames when a transformation feeds multiple downstream steps:
from pyspark.storagelevel import StorageLevel
orders_enriched = (
spark.table("bronze.orders")
.join(customers, "customer_id", "left")
.filter("order_status != 'cancelled'")
.withColumn("order_value_usd", col("order_value") * col("fx_rate"))
)
# Cache at MEMORY_AND_DISK — safe for large datasets
orders_enriched.persist(StorageLevel.MEMORY_AND_DISK)
# Force materialization
orders_enriched.count()
# Now use in multiple downstream operations without re-reading
orders_by_region = orders_enriched.groupBy("region").agg(...)
orders_by_product = orders_enriched.groupBy("product_id").agg(...)
# Clean up after the pipeline stage is done
orders_enriched.unpersist()
5. Showback and Chargeback — cost governance at scale
Optimization without governance is a one-time win. The billing comes back the moment a new team onboards or a new pipeline goes to production.
Showback: visibility before accountability
Showback shows teams what they're spending without charging their budget directly. Start here, not with chargeback:
-- Weekly showback report by team (requires cost_center tag from policies)
WITH usage_tagged AS (
SELECT
u.usage_date,
u.usage_metadata.job_id,
u.usage_metadata.cluster_id,
u.usage_quantity AS dbus,
u.usage_quantity * lp.default AS cost_usd,
c.custom_tags['cost_center'] AS cost_center,
c.custom_tags['owner'] AS owner
FROM system.billing.usage u
JOIN system.billing.list_prices lp
ON u.sku_name = lp.sku_name AND u.cloud = lp.cloud
LEFT JOIN system.compute.clusters c
ON u.usage_metadata.cluster_id = c.cluster_id
WHERE u.usage_date >= CURRENT_DATE - INTERVAL 7 DAYS
)
SELECT
cost_center,
SUM(dbus) AS total_dbus,
ROUND(SUM(cost_usd), 2) AS total_cost_usd,
COUNT(DISTINCT cluster_id) AS distinct_clusters,
COUNT(DISTINCT job_id) AS distinct_jobs
FROM usage_tagged
WHERE cost_center IS NOT NULL
GROUP BY cost_center
ORDER BY total_cost_usd DESC;
Chargeback: hard accountability
Chargeback allocates actual cost to team budgets. Move to it only after showback has been running for at least 30 days and teams trust the numbers. The query structure is the same — the difference is in what you do with the output.
Cost anomaly detection
Wire this query to a Databricks SQL alert. It fires when yesterday's cost was more than 2 standard deviations above the 30-day average:
WITH daily_cost AS (
SELECT
usage_date,
SUM(usage_quantity * lp.default) AS cost_usd
FROM system.billing.usage u
JOIN system.billing.list_prices lp
ON u.sku_name = lp.sku_name AND u.cloud = lp.cloud
WHERE usage_date >= CURRENT_DATE - INTERVAL 31 DAYS
GROUP BY usage_date
),
stats AS (
SELECT
AVG(cost_usd) AS avg_cost,
STDDEV(cost_usd) AS stddev_cost
FROM daily_cost
WHERE usage_date < CURRENT_DATE -- exclude today
)
SELECT
d.usage_date,
d.cost_usd,
s.avg_cost,
ROUND((d.cost_usd - s.avg_cost) / NULLIF(s.stddev_cost, 0), 2) AS z_score
FROM daily_cost d
CROSS JOIN stats s
WHERE usage_date = CURRENT_DATE - INTERVAL 1 DAY
AND (d.cost_usd - s.avg_cost) / NULLIF(s.stddev_cost, 0) > 2.0;
Route the alert to Slack or email so the on-call engineer sees it the next morning.
The case: what we actually did, in order
| Action | Time to implement | Cost reduction | |---|---|---| | Enable System Tables + audit idle clusters | 1 day | — | | Set auto-termination on all existing clusters | 2 hours | -15% | | Roll out cluster policies (job + all-purpose) | 1 week | -25% | | Enable spot instances on all batch job clusters | 2 days | -18% | | Enable AQE + aggressive scale-down | 1 day | -8% | | Delta OPTIMIZE + liquid clustering on hot tables | 3 days | -6% | | Deploy showback dashboard + anomaly alerts | 1 week | Governance | | Total | ~8 weeks | -60% |
The first two actions took less than a day and cut 15% of spend. None of this touched pipeline logic, SLAs, or output data quality.
Decision framework: where to start
Five questions to answer before touching anything:
- Can you query
system.billing.usage? If not, enable System Tables first. Nothing else is actionable without visibility. - What percentage of your DBUs come from all-purpose clusters? If it's over 40%, auto-termination policies come first.
- Are your batch job clusters using spot instances? If not,
SPOT_WITH_FALLBACKis a zero-risk change that saves 60–80% on compute. - Do you have
cost_centertags on your clusters? If not, you can't do chargeback later. Add the tag requirement to policies now. - Is AQE enabled? On DBR 10.4+, it should be on by default. Verify with
spark.conf.get("spark.sql.adaptive.enabled").
The sequence matters more than the individual steps. Visibility before governance, governance before optimization. Teams that jump straight to Spark tuning without knowing where their money goes usually end up optimizing the wrong jobs.
Rules that held
- Enable System Tables on day one. They don't backfill historical data, so every day you wait is data you lose permanently.
- Require
cost_centertags in cluster policies before you have more than 5 engineers on the platform. - Never use pure SPOT for streaming pipelines. The recovery cost in processing lag exceeds the savings.
- Run showback for 30 days before moving to chargeback. Teams need to trust the numbers before being charged by them.
- Treat auto-termination as a default, not a setting that someone might configure. No interactive cluster should exist without it.