Change Data Capture is the backbone of every serious Lakehouse. If you're still running nightly full loads from your source databases, you're paying ten times more than you should and introducing unnecessary data latency. This article covers the ADF patterns that I've actually run in production — not theoretical architectures, but the pipelines that are running right now.
CDC vs. full load: know when to use each
Full load is simple: truncate and reload. It's reliable, easy to debug, and perfectly fine for small dimension tables that change infrequently (under 100k rows, updated weekly). Use it there.
The moment you're dealing with tables that:
- Have millions of rows
- Change continuously during business hours
- Need to reflect updates within hours, not days
...you need CDC. The cost difference alone justifies the implementation complexity.
Strategy 1: Watermark-based CDC
This is the most practical CDC approach for SQL Server and Azure SQL Database sources. You rely on a column like updated_at or modified_datetime to identify changed rows.
The control table comes first. You need a watermark control table in a dedicated schema:
CREATE TABLE control.watermarks (
pipeline_name VARCHAR(200) PRIMARY KEY,
source_table VARCHAR(200),
watermark_col VARCHAR(100),
last_value DATETIME2,
last_run_at DATETIME2
);
INSERT INTO control.watermarks VALUES (
'customers_bronze', 'dbo.Customers', 'UpdatedAt',
'1900-01-01', GETDATE()
);
Your pipeline has three activities chained together:
- Lookup — reads the current watermark from the control table
- Copy Data — reads from source where
updated_at > @{activity('GetWatermark').output.firstRow.last_value} - Stored Procedure — updates the watermark after successful copy
{
"name": "Copy_Incremental",
"type": "Copy",
"inputs": [{ "referenceName": "DS_Source_Customers" }],
"outputs": [{ "referenceName": "DS_ADLS_Bronze" }],
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": "SELECT * FROM dbo.Customers WHERE UpdatedAt > '@{activity('GetWatermark').output.firstRow.last_value}' AND UpdatedAt <= GETDATE()"
}
}
}
Note the upper bound <= GETDATE(). This prevents the race condition where a row is updated between when you read the source and when you update the watermark.
Strategy 2: Soft delete handling
Watermark captures inserts and updates. It does not capture deletes — unless your source uses soft deletes (an is_deleted flag). If rows disappear from the source without any updated_at change, you'll silently retain stale data in Bronze.
Three approaches, in order of preference:
Option A is source-side soft deletes, and it's the best. Work with the source system team to implement an is_deleted flag. The flag gets set to true and updated_at gets bumped when a record is "deleted". Your watermark query picks it up normally. In Silver, you filter is_deleted = false during transformation.
Option B is a full load for small tables with deletes. If the table has fewer than 500k rows and deletes are business-critical, just full-load it. The operational simplicity is worth it at that scale.
Option C is periodic reconciliation. Once per week, run a reconciliation job that compares Bronze row counts by partition against the source. Flag discrepancies for investigation. This doesn't prevent stale data, but it catches it.
ADF pipeline parameterization
Hard-coding table names is how pipelines become unmaintainable. The correct pattern is a metadata-driven pipeline: one generic pipeline that reads a config table and iterates.
{
"name": "Pipeline_CDC_Master",
"activities": [
{
"name": "Lookup_Tables",
"type": "Lookup",
"typeProperties": {
"source": {
"type": "AzureSqlSource",
"sqlReaderQuery": "SELECT * FROM control.cdc_config WHERE is_active = 1"
}
}
},
{
"name": "ForEach_Table",
"type": "ForEach",
"typeProperties": {
"items": "@activity('Lookup_Tables').output.value",
"activities": [
{
"name": "Execute_Table_Pipeline",
"type": "ExecutePipeline",
"typeProperties": {
"pipeline": { "referenceName": "Pipeline_CDC_Single_Table" },
"parameters": {
"source_table": "@item().source_table",
"target_path": "@item().target_path",
"watermark_col": "@item().watermark_col"
}
}
}
]
}
}
]
}
The control.cdc_config table is your single source of truth. Adding a new table to the pipeline is a one-row INSERT, not a pipeline modification.
Schema drift
Source schemas change without warning. An ADF Copy Activity with schema drift disabled will fail hard when a new column appears. Enable it:
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": { "referenceName": "LS_ADLS" },
"path": "bronze/_schema_errors/@{pipeline().RunId}"
}
Incompatible rows land in a dedicated error path, not in the main Bronze table. You get a complete run instead of a total failure, and you have the rejected rows for investigation.
Retry and monitoring
ADF retries are configured at the activity level, not the pipeline level. Set them on your Copy activity:
"policy": {
"timeout": "0.01:00:00",
"retry": 3,
"retryIntervalInSeconds": 300,
"secureOutput": false
}
Three retries with a 5-minute interval handles most transient Azure storage and network issues.
For monitoring, configure pipeline failure alerts to send email via an ADF alert rule. Beyond that, log every pipeline run — source row count, copy duration, watermark before and after — to a Delta table in Bronze. This becomes your observability layer.
Real incidents and what we learned
The timezone incident. Our source SQL Server was in UTC-3. Our watermark was stored in UTC. We were consistently missing the last 3 hours of changes per run. Fix: always store watermarks in UTC, always cast source timestamps to UTC explicitly in the query.
The concurrent run problem. Two pipeline triggers overlapped during a deployment window. Both read the same watermark, both wrote the same rows, both updated the watermark. Result: duplicate rows in Bronze, silent data quality issue downstream. Fix: use an ADF exclusive lock activity or a database-level semaphore in the control table.
Schema drift in production. A source team added a 8000-character VARCHAR column without notice. The Copy activity failed because ADLS Parquet writer had a column size limit. Fix: schema drift enabled, column type mapping rules defined explicitly in the dataset.
None of these are edge cases. They will happen to your pipeline too.