Why Databricks is central to the modern Lakehouse
Databricks combines Apache Spark with Delta Lake in a managed platform, eliminating the operational complexity of running your own Spark cluster. For data engineers working with Azure Lakehouses, Databricks is where most Bronze → Silver → Gold transformations happen.
Cluster types
All-purpose cluster
Used for interactive development in notebooks. Never use in production — it's expensive and stays on even when idle.
- Use for exploration, development and debugging
- Set auto-termination to 30-60 minutes
- Share clusters across the team to reduce cost
Job cluster
Created specifically for a job run and destroyed when it finishes. This is the correct standard for production.
{
"new_cluster": {
"spark_version": "14.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 4,
"autoscale": {
"min_workers": 2,
"max_workers": 8
}
}
}
SQL warehouse
For analytical SQL queries and BI via Databricks SQL. Decoupled from Spark clusters — more efficient for purely SQL workloads.
Cluster sizing for production
Avoid the common mistake of overprovisioning. Start small and monitor:
# Check resource usage in the job
spark.sparkContext.statusTracker().getExecutorInfos()
For typical transformation jobs:
- Small (up to 50GB):
Standard_DS3_v2with 2-4 workers - Medium (50-500GB):
Standard_DS4_v2with 4-8 workers - Large (500GB+): consider
Standard_DS5_v2with autoscaling
Essential Spark config
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")
spark.conf.set("spark.sql.shuffle.partitions", "200") # adjust by volume
Notebooks in production
Recommended structure
notebooks/
ingestion/
bronze_orders.py
bronze_customers.py
transformation/
silver_orders.py
gold_revenue.py
utils/
common_functions.py
logging.py
%run vs dbutils.notebook.run
Use %run to import utilities on the same cluster (more efficient):
%run ../utils/common_functions
Use dbutils.notebook.run to execute notebooks as subprocesses with parameters:
result = dbutils.notebook.run(
"./silver_orders",
timeout_seconds=3600,
arguments={"date": "2026-04-01", "env": "prod"}
)
Widgets for parameterization
dbutils.widgets.text("execution_date", "", "Execution date")
dbutils.widgets.dropdown("env", "dev", ["dev", "staging", "prod"])
execution_date = dbutils.widgets.get("execution_date")
env = dbutils.widgets.get("env")
Databricks Workflows (Jobs)
Creating a multi-task job
{
"name": "pipeline_silver_daily",
"tasks": [
{
"task_key": "ingest_bronze",
"notebook_task": { "notebook_path": "/ingestion/bronze_orders" },
"new_cluster": { ... }
},
{
"task_key": "transform_silver",
"depends_on": [{ "task_key": "ingest_bronze" }],
"notebook_task": { "notebook_path": "/transformation/silver_orders" },
"new_cluster": { ... }
}
],
"schedule": {
"quartz_cron_expression": "0 0 6 * * ?",
"timezone_id": "America/Sao_Paulo"
}
}
Retry and alerts
{
"max_retries": 2,
"min_retry_interval_millis": 300000,
"email_notifications": {
"on_failure": ["data-team@company.com"],
"on_success": []
}
}
Secrets with Databricks Secret Scope
Never store credentials in notebooks. Use Secret Scopes linked to Azure Key Vault:
storage_account_key = dbutils.secrets.get(
scope="kv-lakehouse-prod",
key="adls-storage-account-key"
)
spark.conf.set(
"fs.azure.account.key.datalakeprod.dfs.core.windows.net",
storage_account_key
)
Cost optimization
- Spot instances for interruption-tolerant jobs (60-80% savings)
- Auto-terminate mandatory on all-purpose clusters
- Photon only where there's measurable gain on analytical SQL queries
- Job clusters always in production, never all-purpose
- Monitor with Cost Analysis in Azure + Databricks Account Console
Unity Catalog integration
# Always use three-part naming with Unity Catalog
df = spark.read.table("catalog_prod.silver.orders")
df.write \
.format("delta") \
.mode("overwrite") \
.saveAsTable("catalog_prod.gold.revenue_daily")
Avoid absolute ADLS paths when Unity Catalog is configured — always use the catalog namespace.