Databricks·2026-03-28·3 min read·

Databricks for Data Engineers: clusters, jobs and notebooks in production

Everything you need to operate Databricks in production: cluster types, orchestration with Workflows and cost best practices.

Why Databricks is central to the modern Lakehouse

Databricks combines Apache Spark with Delta Lake in a managed platform, eliminating the operational complexity of running your own Spark cluster. For data engineers working with Azure Lakehouses, Databricks is where most Bronze → Silver → Gold transformations happen.

Cluster types

All-purpose cluster

Used for interactive development in notebooks. Never use in production — it's expensive and stays on even when idle.

  • Use for exploration, development and debugging
  • Set auto-termination to 30-60 minutes
  • Share clusters across the team to reduce cost

Job cluster

Created specifically for a job run and destroyed when it finishes. This is the correct standard for production.

{
  "new_cluster": {
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 4,
    "autoscale": {
      "min_workers": 2,
      "max_workers": 8
    }
  }
}

SQL warehouse

For analytical SQL queries and BI via Databricks SQL. Decoupled from Spark clusters — more efficient for purely SQL workloads.

Cluster sizing for production

Avoid the common mistake of overprovisioning. Start small and monitor:

# Check resource usage in the job
spark.sparkContext.statusTracker().getExecutorInfos()

For typical transformation jobs:

  • Small (up to 50GB): Standard_DS3_v2 with 2-4 workers
  • Medium (50-500GB): Standard_DS4_v2 with 4-8 workers
  • Large (500GB+): consider Standard_DS5_v2 with autoscaling

Essential Spark config

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
spark.conf.set("spark.sql.shuffle.partitions", "200")  # adjust by volume

Notebooks in production

Recommended structure

notebooks/
  ingestion/
    bronze_orders.py
    bronze_customers.py
  transformation/
    silver_orders.py
    gold_revenue.py
  utils/
    common_functions.py
    logging.py

%run vs dbutils.notebook.run

Use %run to import utilities on the same cluster (more efficient):

%run ../utils/common_functions

Use dbutils.notebook.run to execute notebooks as subprocesses with parameters:

result = dbutils.notebook.run(
  "./silver_orders",
  timeout_seconds=3600,
  arguments={"date": "2026-04-01", "env": "prod"}
)

Widgets for parameterization

dbutils.widgets.text("execution_date", "", "Execution date")
dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"])

execution_date = dbutils.widgets.get("execution_date")
env = dbutils.widgets.get("env")

Databricks Workflows (Jobs)

Creating a multi-task job

{
  "name": "pipeline_silver_daily",
  "tasks": [
    {
      "task_key": "ingest_bronze",
      "notebook_task": { "notebook_path": "/ingestion/bronze_orders" },
      "new_cluster": { ... }
    },
    {
      "task_key": "transform_silver",
      "depends_on": [{ "task_key": "ingest_bronze" }],
      "notebook_task": { "notebook_path": "/transformation/silver_orders" },
      "new_cluster": { ... }
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 0 6 * * ?",
    "timezone_id": "America/Sao_Paulo"
  }
}

Retry and alerts

{
  "max_retries": 2,
  "min_retry_interval_millis": 300000,
  "email_notifications": {
    "on_failure": ["data-team@company.com"],
    "on_success": []
  }
}

Secrets with Databricks Secret Scope

Never store credentials in notebooks. Use Secret Scopes linked to Azure Key Vault:

storage_account_key = dbutils.secrets.get(
    scope="kv-lakehouse-prod",
    key="adls-storage-account-key"
)

spark.conf.set(
    "fs.azure.account.key.datalakeprod.dfs.core.windows.net",
    storage_account_key
)

Cost optimization

  1. Spot instances for interruption-tolerant jobs (60-80% savings)
  2. Auto-terminate mandatory on all-purpose clusters
  3. Photon only where there's measurable gain on analytical SQL queries
  4. Job clusters always in production, never all-purpose
  5. Monitor with Cost Analysis in Azure + Databricks Account Console

Unity Catalog integration

# Always use three-part naming with Unity Catalog
df = spark.read.table("catalog_prod.silver.orders")

df.write \
  .format("delta") \
  .mode("overwrite") \
  .saveAsTable("catalog_prod.gold.revenue_daily")

Avoid absolute ADLS paths when Unity Catalog is configured — always use the catalog namespace.