datarani

Why Databricks is central to the modern Lakehouse

Databricks combines Apache Spark with Delta Lake in a managed platform, eliminating the operational complexity of running your own Spark cluster. For data engineers working with Azure Lakehouses, Databricks is where most Bronze → Silver → Gold transformations happen.

Cluster types

All-purpose cluster

Used for interactive development in notebooks. Never use in production: it's expensive and stays on even when idle.

Use for exploration, development and debugging
Set auto-termination to 30-60 minutes
Share clusters across the team to reduce cost

Job cluster

Created specifically for a job run and destroyed when it finishes. This is the correct standard for production.

{
  "new_cluster": {
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 4,
    "autoscale": {
      "min_workers": 2,
      "max_workers": 8
    }
  }
}

SQL warehouse

For analytical SQL queries and BI via Databricks SQL. Decoupled from Spark clusters, it's more efficient for purely SQL workloads.

Cluster sizing for production

Avoid the common mistake of overprovisioning. Start small and monitor:

# Check resource usage in the job
spark.sparkContext.statusTracker().getExecutorInfos()

For typical transformation jobs:

Small (up to 50GB): Standard_DS3_v2 with 2-4 workers
Medium (50-500GB): Standard_DS4_v2 with 4-8 workers
Large (500GB+): consider Standard_DS5_v2 with autoscaling

Essential Spark config

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
spark.conf.set("spark.sql.shuffle.partitions", "200")  # adjust by volume

Notebooks in production

Recommended structure

notebooks/
  ingestion/
    bronze_orders.py
    bronze_customers.py
  transformation/
    silver_orders.py
    gold_revenue.py
  utils/
    common_functions.py
    logging.py

%run vs dbutils.notebook.run

Use %run to import utilities on the same cluster (more efficient):

%run ../utils/common_functions

Use dbutils.notebook.run to execute notebooks as subprocesses with parameters:

result = dbutils.notebook.run(
  "./silver_orders",
  timeout_seconds=3600,
  arguments={"date": "2026-04-01", "env": "prod"}
)

Widgets for parameterization

dbutils.widgets.text("execution_date", "", "Execution date")
dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"])

execution_date = dbutils.widgets.get("execution_date")
env = dbutils.widgets.get("env")

Databricks Workflows (Jobs)

Creating a multi-task job

{
  "name": "pipeline_silver_daily",
  "tasks": [
    {
      "task_key": "ingest_bronze",
      "notebook_task": { "notebook_path": "/ingestion/bronze_orders" },
      "new_cluster": { ... }
    },
    {
      "task_key": "transform_silver",
      "depends_on": [{ "task_key": "ingest_bronze" }],
      "notebook_task": { "notebook_path": "/transformation/silver_orders" },
      "new_cluster": { ... }
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 0 6 * * ?",
    "timezone_id": "America/Sao_Paulo"
  }
}

Retry and alerts

{
  "max_retries": 2,
  "min_retry_interval_millis": 300000,
  "email_notifications": {
    "on_failure": ["data-team@company.com"],
    "on_success": []
  }
}

Secrets with Databricks Secret Scope

Never store credentials in notebooks. Use Secret Scopes linked to Azure Key Vault:

storage_account_key = dbutils.secrets.get(
    scope="kv-lakehouse-prod",
    key="adls-storage-account-key"
)

spark.conf.set(
    "fs.azure.account.key.datalakeprod.dfs.core.windows.net",
    storage_account_key
)

Cost optimization

Spot instances for interruption-tolerant jobs (60-80% savings)
Auto-terminate is mandatory on all-purpose clusters
Photon only where there's measurable gain on analytical SQL queries
Job clusters always in production, never all-purpose
Monitor with Cost Analysis in Azure and the Databricks Account Console

Unity Catalog integration

# Always use three-part naming with Unity Catalog
df = spark.read.table("catalog_prod.silver.orders")

df.write \
  .format("delta") \
  .mode("overwrite") \
  .saveAsTable("catalog_prod.gold.revenue_daily")

Avoid absolute ADLS paths when Unity Catalog is configured. Always use the catalog namespace.

Databricks for Data Engineers: clusters, jobs and notebooks in production