Databricks·2026-03-28·3 min read·

Databricks for Data Engineers: clusters, jobs and notebooks in production

Everything you need to operate Databricks in production: cluster types, orchestration with Workflows and cost best practices.

Why Databricks is central to the modern Lakehouse

Databricks combines Apache Spark with Delta Lake in a managed platform, eliminating the operational complexity of running your own Spark cluster. For data engineers working with Azure Lakehouses, Databricks is where most Bronze → Silver → Gold transformations happen.

Cluster types

All-purpose cluster

Used for interactive development in notebooks. Never use in production: it's expensive and stays on even when idle.

  • Use for exploration, development and debugging
  • Set auto-termination to 30-60 minutes
  • Share clusters across the team to reduce cost

Job cluster

Created specifically for a job run and destroyed when it finishes. This is the correct standard for production.

{
  "new_cluster": {
    "spark_version": "14.3.x-scala2.12",
    "node_type_id": "Standard_DS3_v2",
    "num_workers": 4,
    "autoscale": {
      "min_workers": 2,
      "max_workers": 8
    }
  }
}

SQL warehouse

For analytical SQL queries and BI via Databricks SQL. Decoupled from Spark clusters, it's more efficient for purely SQL workloads.

Cluster sizing for production

Avoid the common mistake of overprovisioning. Start small and monitor:

# Check resource usage in the job
spark.sparkContext.statusTracker().getExecutorInfos()

For typical transformation jobs:

  • Small (up to 50GB): Standard_DS3_v2 with 2-4 workers
  • Medium (50-500GB): Standard_DS4_v2 with 4-8 workers
  • Large (500GB+): consider Standard_DS5_v2 with autoscaling

Essential Spark config

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
spark.conf.set("spark.sql.shuffle.partitions", "200")  # adjust by volume

Notebooks in production

Recommended structure

notebooks/
  ingestion/
    bronze_orders.py
    bronze_customers.py
  transformation/
    silver_orders.py
    gold_revenue.py
  utils/
    common_functions.py
    logging.py

%run vs dbutils.notebook.run

Use %run to import utilities on the same cluster (more efficient):

%run ../utils/common_functions

Use dbutils.notebook.run to execute notebooks as subprocesses with parameters:

result = dbutils.notebook.run(
  "./silver_orders",
  timeout_seconds=3600,
  arguments={"date": "2026-04-01", "env": "prod"}
)

Widgets for parameterization

dbutils.widgets.text("execution_date", "", "Execution date")
dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"])

execution_date = dbutils.widgets.get("execution_date")
env = dbutils.widgets.get("env")

Databricks Workflows (Jobs)

Creating a multi-task job

{
  "name": "pipeline_silver_daily",
  "tasks": [
    {
      "task_key": "ingest_bronze",
      "notebook_task": { "notebook_path": "/ingestion/bronze_orders" },
      "new_cluster": { ... }
    },
    {
      "task_key": "transform_silver",
      "depends_on": [{ "task_key": "ingest_bronze" }],
      "notebook_task": { "notebook_path": "/transformation/silver_orders" },
      "new_cluster": { ... }
    }
  ],
  "schedule": {
    "quartz_cron_expression": "0 0 6 * * ?",
    "timezone_id": "America/Sao_Paulo"
  }
}

Retry and alerts

{
  "max_retries": 2,
  "min_retry_interval_millis": 300000,
  "email_notifications": {
    "on_failure": ["data-team@company.com"],
    "on_success": []
  }
}

Secrets with Databricks Secret Scope

Never store credentials in notebooks. Use Secret Scopes linked to Azure Key Vault:

storage_account_key = dbutils.secrets.get(
    scope="kv-lakehouse-prod",
    key="adls-storage-account-key"
)

spark.conf.set(
    "fs.azure.account.key.datalakeprod.dfs.core.windows.net",
    storage_account_key
)

Cost optimization

  1. Spot instances for interruption-tolerant jobs (60-80% savings)
  2. Auto-terminate is mandatory on all-purpose clusters
  3. Photon only where there's measurable gain on analytical SQL queries
  4. Job clusters always in production, never all-purpose
  5. Monitor with Cost Analysis in Azure and the Databricks Account Console

Unity Catalog integration

# Always use three-part naming with Unity Catalog
df = spark.read.table("catalog_prod.silver.orders")

df.write \
  .format("delta") \
  .mode("overwrite") \
  .saveAsTable("catalog_prod.gold.revenue_daily")

Avoid absolute ADLS paths when Unity Catalog is configured. Always use the catalog namespace.