Why Databricks is central to the modern Lakehouse
Databricks combines Apache Spark with Delta Lake in a managed platform, eliminating the operational complexity of running your own Spark cluster. For data engineers working with Azure Lakehouses, Databricks is where most Bronze → Silver → Gold transformations happen.
Cluster types
All-purpose cluster
Used for interactive development in notebooks. Never use in production: it's expensive and stays on even when idle.
- Use for exploration, development and debugging
- Set auto-termination to 30-60 minutes
- Share clusters across the team to reduce cost
Job cluster
Created specifically for a job run and destroyed when it finishes. This is the correct standard for production.
{
"new_cluster": {
"spark_version": "14.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 4,
"autoscale": {
"min_workers": 2,
"max_workers": 8
}
}
}
SQL warehouse
For analytical SQL queries and BI via Databricks SQL. Decoupled from Spark clusters, it's more efficient for purely SQL workloads.
Cluster sizing for production
Avoid the common mistake of overprovisioning. Start small and monitor:
# Check resource usage in the job
spark.sparkContext.statusTracker().getExecutorInfos()
For typical transformation jobs:
- Small (up to 50GB):
Standard_DS3_v2with 2-4 workers - Medium (50-500GB):
Standard_DS4_v2with 4-8 workers - Large (500GB+): consider
Standard_DS5_v2with autoscaling
Essential Spark config
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
spark.conf.set("spark.sql.shuffle.partitions", "200") # adjust by volume
Notebooks in production
Recommended structure
notebooks/
ingestion/
bronze_orders.py
bronze_customers.py
transformation/
silver_orders.py
gold_revenue.py
utils/
common_functions.py
logging.py
%run vs dbutils.notebook.run
Use %run to import utilities on the same cluster (more efficient):
%run ../utils/common_functions
Use dbutils.notebook.run to execute notebooks as subprocesses with parameters:
result = dbutils.notebook.run(
"./silver_orders",
timeout_seconds=3600,
arguments={"date": "2026-04-01", "env": "prod"}
)
Widgets for parameterization
dbutils.widgets.text("execution_date", "", "Execution date")
dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"])
execution_date = dbutils.widgets.get("execution_date")
env = dbutils.widgets.get("env")
Databricks Workflows (Jobs)
Creating a multi-task job
{
"name": "pipeline_silver_daily",
"tasks": [
{
"task_key": "ingest_bronze",
"notebook_task": { "notebook_path": "/ingestion/bronze_orders" },
"new_cluster": { ... }
},
{
"task_key": "transform_silver",
"depends_on": [{ "task_key": "ingest_bronze" }],
"notebook_task": { "notebook_path": "/transformation/silver_orders" },
"new_cluster": { ... }
}
],
"schedule": {
"quartz_cron_expression": "0 0 6 * * ?",
"timezone_id": "America/Sao_Paulo"
}
}
Retry and alerts
{
"max_retries": 2,
"min_retry_interval_millis": 300000,
"email_notifications": {
"on_failure": ["data-team@company.com"],
"on_success": []
}
}
Secrets with Databricks Secret Scope
Never store credentials in notebooks. Use Secret Scopes linked to Azure Key Vault:
storage_account_key = dbutils.secrets.get(
scope="kv-lakehouse-prod",
key="adls-storage-account-key"
)
spark.conf.set(
"fs.azure.account.key.datalakeprod.dfs.core.windows.net",
storage_account_key
)
Cost optimization
- Spot instances for interruption-tolerant jobs (60-80% savings)
- Auto-terminate is mandatory on all-purpose clusters
- Photon only where there's measurable gain on analytical SQL queries
- Job clusters always in production, never all-purpose
- Monitor with Cost Analysis in Azure and the Databricks Account Console
Unity Catalog integration
# Always use three-part naming with Unity Catalog
df = spark.read.table("catalog_prod.silver.orders")
df.write \
.format("delta") \
.mode("overwrite") \
.saveAsTable("catalog_prod.gold.revenue_daily")
Avoid absolute ADLS paths when Unity Catalog is configured. Always use the catalog namespace.