datarani

Change Data Capture is the backbone of every serious Lakehouse. If you're still running nightly full loads from your source databases, you're paying ten times more than you should and introducing unnecessary data latency. This article covers the ADF patterns that I've actually run in production — not theoretical architectures, but the pipelines that are running right now.

CDC vs. full load: know when to use each

Full load is simple: truncate and reload. It's reliable, easy to debug, and perfectly fine for small dimension tables that change infrequently (under 100k rows, updated weekly). Use it there.

The moment you're dealing with tables that:

Have millions of rows
Change continuously during business hours
Need to reflect updates within hours, not days

...you need CDC. The cost difference alone justifies the implementation complexity.

Strategy 1: Watermark-based CDC

This is the most practical CDC approach for SQL Server and Azure SQL Database sources. You rely on a column like updated_at or modified_datetime to identify changed rows.

The control table comes first. You need a watermark control table in a dedicated schema:

CREATE TABLE control.watermarks (
    pipeline_name   VARCHAR(200) PRIMARY KEY,
    source_table    VARCHAR(200),
    watermark_col   VARCHAR(100),
    last_value      DATETIME2,
    last_run_at     DATETIME2
);

INSERT INTO control.watermarks VALUES (
    'customers_bronze', 'dbo.Customers', 'UpdatedAt',
    '1900-01-01', GETDATE()
);

The pipeline chains three activities:

Lookup: reads the current watermark from the control table
Copy Data: reads from source where updated_at > @{activity('GetWatermark').output.firstRow.last_value}
Stored Procedure: updates the watermark after successful copy

{
  "name": "Copy_Incremental",
  "type": "Copy",
  "inputs": [{ "referenceName": "DS_Source_Customers" }],
  "outputs": [{ "referenceName": "DS_ADLS_Bronze" }],
  "typeProperties": {
    "source": {
      "type": "AzureSqlSource",
      "sqlReaderQuery": "SELECT * FROM dbo.Customers WHERE UpdatedAt > '@{activity('GetWatermark').output.firstRow.last_value}' AND UpdatedAt <= GETDATE()"
    }
  }
}

Note the upper bound <= GETDATE(). This prevents the race condition where a row is updated between when you read the source and when you update the watermark.

Strategy 2: Soft delete handling

Watermark captures inserts and updates. It does not capture deletes, unless your source uses soft deletes (an is_deleted flag). If rows disappear from the source without any updated_at change, you'll silently retain stale data in Bronze.

The best option is source-side soft deletes. Work with the source system team to implement an is_deleted flag. The flag gets set to true and updated_at gets bumped when a record is "deleted". Your watermark query picks it up normally. In Silver, you filter is_deleted = false during transformation.

If the table has fewer than 500k rows and deletes are business-critical, just full-load it. Operational simplicity is worth more than incremental elegance at that scale.

The fallback is periodic reconciliation: once per week, run a job that compares Bronze row counts by partition against the source and flags discrepancies for investigation. This doesn't prevent stale data, but it catches it.

ADF pipeline parameterization

Hard-coding table names is how pipelines become unmaintainable. The correct pattern is a metadata-driven pipeline: one generic pipeline that reads a config table and iterates.

{
  "name": "Pipeline_CDC_Master",
  "activities": [
    {
      "name": "Lookup_Tables",
      "type": "Lookup",
      "typeProperties": {
        "source": {
          "type": "AzureSqlSource",
          "sqlReaderQuery": "SELECT * FROM control.cdc_config WHERE is_active = 1"
        }
      }
    },
    {
      "name": "ForEach_Table",
      "type": "ForEach",
      "typeProperties": {
        "items": "@activity('Lookup_Tables').output.value",
        "activities": [
          {
            "name": "Execute_Table_Pipeline",
            "type": "ExecutePipeline",
            "typeProperties": {
              "pipeline": { "referenceName": "Pipeline_CDC_Single_Table" },
              "parameters": {
                "source_table": "@item().source_table",
                "target_path": "@item().target_path",
                "watermark_col": "@item().watermark_col"
              }
            }
          }
        ]
      }
    }
  ]
}

The control.cdc_config table is your single source of truth. Adding a new table to the pipeline is a one-row INSERT, not a pipeline modification.

Schema drift

Source schemas change without warning. An ADF Copy Activity with schema drift disabled will fail hard when a new column appears. Enable it:

"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
  "linkedServiceName": { "referenceName": "LS_ADLS" },
  "path": "bronze/_schema_errors/@{pipeline().RunId}"
}

Incompatible rows land in a dedicated error path, not in the main Bronze table. You get a complete run instead of a total failure, and you have the rejected rows for investigation.

Retry and monitoring

ADF retries are configured at the activity level, not the pipeline level. Set them on your Copy activity:

"policy": {
  "timeout": "0.01:00:00",
  "retry": 3,
  "retryIntervalInSeconds": 300,
  "secureOutput": false
}

Three retries with a 5-minute interval handles most transient Azure storage and network issues.

For monitoring, configure pipeline failure alerts to send email via an ADF alert rule. Beyond that, log every pipeline run — source row count, copy duration, watermark before and after — to a Delta table in Bronze. This becomes your observability layer.

Real incidents and what we learned

The timezone incident. Our source SQL Server was in UTC-3. Our watermark was stored in UTC. We were consistently missing the last 3 hours of changes per run. Fix: always store watermarks in UTC, always cast source timestamps to UTC explicitly in the query.

The concurrent run problem. Two pipeline triggers overlapped during a deployment window. Both read the same watermark, both wrote the same rows, both updated the watermark. Result: duplicate rows in Bronze, silent data quality issue downstream. Fix: use an ADF exclusive lock activity or a database-level semaphore in the control table.

Schema drift in production. A source team added a 8000-character VARCHAR column without notice. The Copy activity failed because ADLS Parquet writer had a column size limit. Fix: schema drift enabled, column type mapping rules defined explicitly in the dataset.

None of these are edge cases. They will happen to your pipeline too.