datarani

Why observability matters in a Lakehouse

A pipeline that runs is not the same as a pipeline that runs well. Without observability you only notice problems after they've already caused damage: a job that took 3 hours instead of 30 minutes, a table silently drifting in schema, a cluster spending 80% of its time on shuffle.

Databricks provides a complete native stack for observability. No external APM tools needed to cover the fundamentals.

The observability stack in Databricks

┌─────────────────────────────────────────────────────────┐
│                    SYSTEM TABLES                         │
│  system.access  ·  system.billing  ·  system.compute    │
│  system.lakeflow  ·  system.storage                      │
├─────────────────────────────────────────────────────────┤
│  Spark UI  │  Query Profiler  │  Lakehouse Monitoring   │
├─────────────────────────────────────────────────────────┤
│         Cluster Logs  ·  Job Run History                │
└─────────────────────────────────────────────────────────┘

1. System Tables — the observability foundation

System Tables are Unity Catalog tables populated automatically by Databricks. They're the most powerful native observability resource, and the first thing you should enable in any new workspace.

Enabling System Tables

-- Enable all available schemas (run as account admin)
ALTER SYSTEM ENABLE SCHEMA system.access;
ALTER SYSTEM ENABLE SCHEMA system.billing;
ALTER SYSTEM ENABLE SCHEMA system.compute;
ALTER SYSTEM ENABLE SCHEMA system.lakeflow;
ALTER SYSTEM ENABLE SCHEMA system.storage;

system.access — audit logs

Every action in Unity Catalog generates a record here: who accessed what, when, and from where.

-- Who accessed the orders table in the last 7 days?
SELECT
  event_time,
  user_identity.email AS user,
  action_name,
  request_params.full_name AS table_name
FROM system.access.audit
WHERE
  action_name IN ('getTable', 'executeQuery')
  AND request_params.full_name LIKE '%silver.orders%'
  AND event_time >= CURRENT_TIMESTAMP - INTERVAL 7 DAYS
ORDER BY event_time DESC;

-- Top 10 users by number of accesses
SELECT
  user_identity.email AS user,
  COUNT(*) AS total_actions,
  COUNT(DISTINCT DATE(event_time)) AS active_days
FROM system.access.audit
WHERE event_time >= CURRENT_TIMESTAMP - INTERVAL 30 DAYS
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10;

system.billing — cost per job/cluster

-- Cost per workspace in the last 30 days
SELECT
  workspace_id,
  SUM(usage_quantity) AS total_dbus,
  SUM(usage_quantity * list_prices.default) AS estimated_cost_usd
FROM system.billing.usage
LEFT JOIN system.billing.list_prices USING (sku_name, cloud)
WHERE usage_date >= CURRENT_DATE - INTERVAL 30 DAYS
GROUP BY workspace_id
ORDER BY total_dbus DESC;

-- Most expensive jobs
SELECT
  usage_metadata.job_id,
  usage_metadata.job_name,
  SUM(usage_quantity) AS total_dbus
FROM system.billing.usage
WHERE
  usage_metadata.job_id IS NOT NULL
  AND usage_date >= CURRENT_DATE - INTERVAL 7 DAYS
GROUP BY 1, 2
ORDER BY total_dbus DESC
LIMIT 20;

system.compute — cluster history

-- Clusters created in the last 30 days with peak size
SELECT
  cluster_id,
  cluster_name,
  cluster_source,
  autoscale.min_workers,
  autoscale.max_workers,
  created_by,
  create_time
FROM system.compute.clusters
WHERE create_time >= CURRENT_TIMESTAMP - INTERVAL 30 DAYS
ORDER BY create_time DESC;

system.lakeflow — pipeline runs (Delta Live Tables)

-- Average pipeline duration in the last 14 days
SELECT
  pipeline_name,
  AVG(DATEDIFF(SECOND, update_start_time, update_end_time)) AS avg_duration_sec,
  COUNT(*) AS total_runs,
  SUM(CASE WHEN update_state = 'FAILED' THEN 1 ELSE 0 END) AS failures
FROM system.lakeflow.pipeline_updates
WHERE update_start_time >= CURRENT_TIMESTAMP - INTERVAL 14 DAYS
GROUP BY pipeline_name
ORDER BY avg_duration_sec DESC;

2. Spark UI — task-level diagnosis

Spark UI is available in every cluster and job run. The main tabs for diagnosis:

Identifying slow stages

Go to the Stages tab and sort by Duration. Stages that take much longer than others usually point to one of three things: data skew (one task much larger than others), expensive shuffle (high data transferred between executors), or missing filters (reading far more data than necessary).

Data skew diagnosis

# Detect skew before writing — check task duration distribution
from pyspark.sql import functions as F

df.groupBy("partition_column").count().describe().show()

# If max >> mean, there's skew — repartition with salt
salt = 50
df_salted = df.withColumn("salt", (F.rand() * salt).cast("int")) \
              .repartition("partition_column", "salt")

Reading Spark UI programmatically

# List recent applications via REST API
import requests

spark_ui_url = spark.conf.get("spark.databricks.clusterUsageTags.clusterName")

# Via Databricks API
response = requests.get(
    f"https://<workspace>.azuredatabricks.net/api/2.0/clusters/spark-ui-proxy",
    headers={"Authorization": f"Bearer {dbutils.secrets.get('kv', 'token')}"},
    params={"clusterId": spark.conf.get("spark.databricks.clusterUsageTags.clusterId")}
)

3. Databricks SQL — Query Profiler

For SQL workloads on Databricks SQL warehouses, the Query Profiler is the most detailed tool available.

Accessing query history

-- Slowest queries in the last 24 hours (via system tables)
SELECT
  query_id,
  executed_by AS user,
  query_text,
  duration / 1000 AS duration_sec,
  read_rows,
  produced_rows,
  read_bytes / 1e9 AS read_gb,
  warehouse_id
FROM system.query.history
WHERE
  start_time >= CURRENT_TIMESTAMP - INTERVAL 24 HOURS
  AND status = 'FINISHED'
ORDER BY duration DESC
LIMIT 25;

-- Queries with full table scan (no partition pruning)
SELECT
  query_id,
  query_text,
  read_bytes / 1e9 AS read_gb,
  duration / 1000 AS duration_sec
FROM system.query.history
WHERE
  start_time >= CURRENT_TIMESTAMP - INTERVAL 7 DAYS
  AND read_bytes > 10e9  -- > 10 GB read
  AND produced_rows < 1000  -- but few rows returned
ORDER BY read_bytes DESC;

Reading the Query Profile

In the Databricks SQL UI: SQL Editor → Query History → [query] → Query Profile

Key metrics to monitor:

Rows filtered: ratio between rows read vs rows returned. If > 100x, check partition pruning.
Spill to disk: if > 0, increase warehouse memory or optimize the query
Broadcast vs Sort-Merge Join: prefer broadcast for small tables (< 200 MB)

4. Lakehouse Monitoring — automated data quality

Lakehouse Monitoring creates monitoring tables for Delta tables, automatically tracking statistics and drift.

Creating a monitor

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.catalog import MonitorInferenceLog, MonitorTimeSeries

w = WorkspaceClient()

# Monitor for an inference table (ML model)
w.quality_monitors.create(
    table_name="catalog_prod.gold.model_predictions",
    inference_log=MonitorInferenceLog(
        timestamp_col="prediction_timestamp",
        model_id_col="model_id",
        prediction_col="prediction",
        label_col="actual_label",
        problem_type="classification"
    ),
    output_schema_name="catalog_prod.monitoring",
    assets_dir="/Shared/monitoring/model_predictions"
)

# Monitor for a time series table (pipeline data)
w.quality_monitors.create(
    table_name="catalog_prod.silver.orders",
    time_series=MonitorTimeSeries(
        timestamp_col="order_date",
        granularities=["1 day", "1 week"]
    ),
    output_schema_name="catalog_prod.monitoring",
    assets_dir="/Shared/monitoring/orders"
)

Querying monitoring results

-- Data drift in the last 14 days
SELECT
  window_start_time,
  column_name,
  avg_value,
  null_count,
  null_percent,
  num_zeros
FROM catalog_prod.monitoring.silver_orders_profile_metrics
WHERE window_start_time >= CURRENT_TIMESTAMP - INTERVAL 14 DAYS
ORDER BY window_start_time DESC, column_name;

-- Detect anomalies (values outside expected range)
SELECT
  window_start_time,
  column_name,
  avg_value,
  LAG(avg_value) OVER (PARTITION BY column_name ORDER BY window_start_time) AS prev_avg,
  ABS(avg_value - LAG(avg_value) OVER (PARTITION BY column_name ORDER BY window_start_time))
    / NULLIF(LAG(avg_value) OVER (PARTITION BY column_name ORDER BY window_start_time), 0) AS pct_change
FROM catalog_prod.monitoring.silver_orders_profile_metrics
QUALIFY pct_change > 0.20  -- > 20% variation between windows
ORDER BY pct_change DESC;

5. Delta Table History — pipeline audit

from delta.tables import DeltaTable

dt = DeltaTable.forName(spark, "catalog_prod.silver.orders")

# Full history with operations and metadata
history = dt.history()
history.select(
    "version",
    "timestamp",
    "operation",
    "operationParameters",
    "operationMetrics"
).show(20, truncate=False)

-- Tables that haven't been updated in 24h (SLA alert)
SELECT
  table_name,
  MAX(timestamp) AS last_write,
  DATEDIFF(HOUR, MAX(timestamp), CURRENT_TIMESTAMP) AS hours_since_last_write
FROM (
  SELECT 'silver.orders'   AS table_name UNION ALL
  SELECT 'silver.customers' AS table_name UNION ALL
  SELECT 'gold.revenue_daily' AS table_name
) tables
JOIN (
  DESCRIBE HISTORY catalog_prod.silver.orders
  -- repeat for each table
) history USING (table_name)
GROUP BY table_name
HAVING hours_since_last_write > 24
ORDER BY hours_since_last_write DESC;

6. Job Run History — SLA monitoring

# Check job runs via SDK
from databricks.sdk import WorkspaceClient
from datetime import datetime, timedelta

w = WorkspaceClient()

runs = w.jobs.list_runs(
    job_id=123456,
    start_time_from=int((datetime.now() - timedelta(days=7)).timestamp() * 1000),
    completed_only=True
)

for run in runs:
    duration_min = (run.end_time - run.start_time) / 60000
    print(f"Run {run.run_id}: {run.state.result_state} — {duration_min:.1f} min")

-- Job run history via system tables
SELECT
  job_id,
  run_name,
  result_state,
  DATEDIFF(SECOND, start_time, end_time) AS duration_sec,
  start_time
FROM system.lakeflow.job_runs
WHERE start_time >= CURRENT_TIMESTAMP - INTERVAL 7 DAYS
ORDER BY start_time DESC;

7. Alerting — closing the loop

Observability without alerts is just a dashboard. Use Databricks SQL Alerts to trigger notifications:

-- Alert: job taking more than 2x the historic average
SELECT
  AVG(DATEDIFF(SECOND, start_time, end_time)) AS avg_duration_sec,
  STDDEV(DATEDIFF(SECOND, start_time, end_time)) AS stddev_sec
FROM system.lakeflow.job_runs
WHERE
  job_id = 123456
  AND result_state = 'SUCCESS'
  AND start_time >= CURRENT_TIMESTAMP - INTERVAL 30 DAYS

Configure in Databricks SQL → Alerts → New Alert pointing to this query with threshold avg_duration_sec > 7200 (2 hours). Destination: email, Slack, PagerDuty via webhook.

Observability dashboard — recommended structure

| Layer | Metric | Tool | Frequency | |---|---|---|---| | Cost | DBUs per job | system.billing | Daily | | Reliability | Failed runs / total runs | system.lakeflow | Per run | | Performance | Avg query duration | system.query.history | Hourly | | Data quality | Null % per column | Lakehouse Monitoring | Daily | | Security | Accesses by user | system.access.audit | Daily | | Freshness | Hours since last write | Delta history | Hourly |

Practical rules

Enable System Tables on day one. Historical data is not backfilled.
Create a dedicated monitoring schema in Unity Catalog for all monitoring outputs.
Set cost alerts before the project grows. It's much harder to cut costs after.
Put Lakehouse Monitoring on Gold tables first. Those are what the business consumes.
Don't monitor everything. Pick 5–10 metrics that actually matter for each pipeline.

Databricks Observability: monitoring your Lakehouse with native tools