datarani

Most Lakehouse pipelines are a collection of notebooks with manual dependency management, custom retry logic, watermarks stored in control tables, and data quality checks scattered across the codebase. Delta Live Tables (DLT) replaces that with a single declarative framework where you define what your data should look like, and Databricks handles the how.

This isn't a marketing pitch for DLT. This is a practical evaluation of where it saves you work and where it introduces constraints you need to plan around.

What DLT actually is

DLT is a framework within Databricks where you define pipeline logic as a series of table declarations using Python or SQL. You don't write execution logic: no explicit reads, no explicit writes, no orchestration. You declare the expected output and the transformation to produce it.

import dlt
from pyspark.sql import functions as F

@dlt.table(
    name="customers_bronze",
    comment="Raw customer data from CRM ingestion"
)
def customers_bronze():
    return (
        spark.readStream
             .format("cloudFiles")
             .option("cloudFiles.format", "json")
             .load("abfss://landing@storage.dfs.core.windows.net/customers/")
    )

@dlt.table(
    name="customers_silver",
    comment="Cleaned and deduplicated customer records"
)
@dlt.expect_or_drop("valid_email", "email IS NOT NULL AND email LIKE '%@%'")
@dlt.expect("valid_status", "status IN ('active', 'inactive', 'pending')")
def customers_silver():
    return (
        dlt.read_stream("customers_bronze")
           .withColumn("email", F.lower(F.trim(F.col("email"))))
           .dropDuplicates(["customer_id"])
    )

Two tables. No watermark management. No retry logic. No explicit write calls. DLT handles incremental processing, checkpointing, lineage, and execution order automatically.

Auto Loader integration

DLT's most practical feature for Lakehouse teams is native Auto Loader integration. Auto Loader (cloudFiles) is Databricks' incremental file ingestion mechanism. It tracks which files in an ADLS path have been processed and ingests only new arrivals.

@dlt.table(name="orders_bronze")
def orders_bronze():
    return (
        spark.readStream
             .format("cloudFiles")
             .option("cloudFiles.format", "parquet")
             .option("cloudFiles.schemaLocation", "/mnt/schema/orders")
             .option("cloudFiles.inferColumnTypes", "true")
             .load("abfss://landing@storage.dfs.core.windows.net/orders/")
    )

Auto Loader maintains a checkpoint in ADLS. If the pipeline fails midway, it resumes exactly where it left off: no duplicate processing, no missed files. This replaces the watermark + control table pattern for file-based ingestion.

Data quality with expectations

This is where DLT earns its place. Instead of scattered filter calls and manual violation logging, you declare quality rules as expectations directly on the table:

@dlt.table(name="payments_silver")
@dlt.expect_or_drop("positive_amount", "amount > 0")
@dlt.expect_or_fail("non_null_payment_id", "payment_id IS NOT NULL")
@dlt.expect("known_currency", "currency IN ('BRL', 'USD', 'EUR')")
def payments_silver():
    return dlt.read_stream("payments_bronze")

Three behaviors are available:

| Decorator | On violation | |---|---| | @dlt.expect | Log the violation, keep the row | | @dlt.expect_or_drop | Drop the violating row, log it | | @dlt.expect_or_fail | Fail the entire pipeline update |

Violation metrics are tracked automatically in the DLT event log and visible in the pipeline UI. You get a time-series of violation rates per rule, per table, without building any of that infrastructure yourself.

SCD Type 2 with APPLY CHANGES

Slowly Changing Dimensions Type 2, one of the most complex patterns in Lakehouse engineering, becomes declarative in DLT:

dlt.create_streaming_table("customers_scd2")

dlt.apply_changes(
    target="customers_scd2",
    source="customers_bronze",
    keys=["customer_id"],
    sequence_by=F.col("updated_at"),
    stored_as_scd_type=2,
    track_history_column_list=["email", "status", "address"]
)

DLT manages the __START_AT, __END_AT, and __CURRENT columns. You query active records with WHERE __CURRENT = true. Historical queries filter by date range. No custom MERGE logic, no manual SCD management.

Streaming vs. triggered execution

DLT pipelines run in two modes.

Continuous mode runs the pipeline indefinitely, processing new data as it arrives. It's the right choice for near-realtime use cases where latency must stay under a few minutes, though the cluster staying live means higher cost.

Triggered mode runs once, processes all pending data, then shuts down. You pay only for compute time used. For most Lakehouse Bronze and Silver pipelines, triggered mode with a Databricks Workflow schedule is the right call. Reserve continuous mode for streaming Bronze ingestion from Kafka topics where latency actually matters.

Where DLT has constraints

You can't call external APIs, run pandas operations, or use non-Spark libraries inside a DLT table function. The execution context is managed by DLT. If your transformation requires something outside Spark, it needs to live outside DLT entirely.

Debugging is slower than a regular notebook. Because DLT manages execution, you can't run individual cells interactively. You iterate by deploying and waiting for the pipeline run. Development mode reduces this friction (smaller cluster, faster startup) but it's still a longer feedback loop than you're probably used to.

At low data volumes, the cost math may not work out. DLT requires a DLT-enabled cluster, which carries a premium over regular job clusters. For large, complex pipelines the operational simplicity justifies it. For small pipelines, think twice.

When to use DLT

Use DLT when:

You have multiple dependent tables with lineage you want tracked automatically
Data quality rules need to be visible and measured over time
You're building new pipelines from scratch and want the operational simplicity
You need SCD Type 2 and don't want to maintain custom MERGE logic

Don't use DLT when:

You're migrating existing complex notebooks with non-Spark dependencies
Your pipeline logic requires arbitrary Python that can't run in Spark
You're on a tight cost budget and the premium doesn't justify the use case

The honest answer is that DLT shines brightest on greenfield pipelines. Retrofitting it onto existing pipelines is often more painful than it's worth.