ADF·2026-04-01·4 min read·

Azure Data Factory in practice: CDC ingestion pipelines from zero to prod

How to build robust ADF pipelines with watermark strategy, parameterization and error handling for ingestion at scale.

What is Azure Data Factory?

Azure Data Factory (ADF) is Microsoft's managed data integration service on Azure. It orchestrates data movement and transformation across 90+ native connectors — from SQL Server, Oracle and SAP to REST APIs, blobs and data lakes.

In a Lakehouse context, ADF is typically the entry point for data: it moves data from source systems into the Bronze layer of ADLS Gen2.

Core concepts

Linked Services

These are connections to external systems. Each source or destination needs a Linked Service with the correct credentials. Always use Azure Key Vault for secrets — never hardcode credentials in pipelines.

{
  "name": "ls_sqlserver_orders",
  "type": "AzureSqlDatabase",
  "typeProperties": {
    "connectionString": "@Microsoft.KeyVault(SecretUri=...)"
  }
}

Datasets

Datasets represent data structure — tables, files or containers. Prefer parameterized datasets to avoid duplication:

{
  "name": "ds_sqlserver_table",
  "parameters": {
    "tableName": { "type": "string" },
    "schemaName": { "type": "string" }
  }
}

Activities

Activities are the units of work inside a pipeline: Copy Data, Lookup, ForEach, If Condition, Stored Procedure, and more.

CDC ingestion with watermark

The watermark strategy is the most common approach for incremental CDC on tables without native CDC enabled.

1. Control table

Create a table in Azure SQL to store the last processed watermark per table:

CREATE TABLE dbo.watermark_control (
    table_name    NVARCHAR(100) PRIMARY KEY,
    last_modified DATETIME2,
    updated_at    DATETIME2 DEFAULT GETDATE()
);

INSERT INTO dbo.watermark_control VALUES ('orders', '2020-01-01', GETDATE());

2. Pipeline structure

The incremental ingestion pipeline follows this flow:

Lookup (last watermark)
    ↓
Lookup (current max timestamp from source)
    ↓
Copy Data (WHERE modified_at > last_wm AND modified_at <= current_wm)
    ↓
Stored Procedure (updates watermark_control)

3. Parameterized Copy Data activity

-- Dynamic query in Copy Data activity
SELECT *
FROM @{dataset().schemaName}.@{dataset().tableName}
WHERE modified_at > '@{activity('GetLastWatermark').output.firstRow.last_modified}'
  AND modified_at <= '@{activity('GetCurrentWatermark').output.firstRow.max_modified}'

4. Update watermark

After the copy, run a Stored Procedure to record the new watermark:

CREATE PROCEDURE dbo.usp_update_watermark
    @table_name    NVARCHAR(100),
    @new_watermark DATETIME2
AS
UPDATE dbo.watermark_control
SET last_modified = @new_watermark,
    updated_at    = GETDATE()
WHERE table_name = @table_name;

Parameterization: one pipeline for multiple tables

Avoid creating one pipeline per table. With ForEach and parameters you ingest N tables with a single pipeline:

{
  "name": "tables_to_ingest",
  "type": "Array",
  "defaultValue": [
    { "schema": "dbo", "table": "orders",    "key": "order_id" },
    { "schema": "dbo", "table": "customers", "key": "customer_id" },
    { "schema": "dbo", "table": "products",  "key": "product_id" }
  ]
}

In the ForEach, access via @item().schema, @item().table, etc.

Handling soft deletes

Source systems often flag deleted records instead of physically deleting them:

-- Capture soft deletes in the same query
SELECT *, is_deleted
FROM dbo.orders
WHERE modified_at > '@{...}'
  AND modified_at <= '@{...}'

In the Silver layer, apply the logical delete logic via MERGE in Databricks.

Triggers and schedules

Tumbling Window Trigger

For incremental ingestion pipelines with fixed time windows, Tumbling Window is better than Schedule Trigger because:

  • It maintains execution state per window
  • It guarantees non-overlapping runs
  • It supports automatic retry per failed window
{
  "type": "TumblingWindowTrigger",
  "typeProperties": {
    "frequency": "Hour",
    "interval": 1,
    "startTime": "2026-01-01T00:00:00Z",
    "retryPolicy": { "count": 3, "intervalInSeconds": 30 }
  }
}

Monitoring

Configure Azure Monitor alerts for:

  • Pipeline failures
  • Duration above expected threshold (ingestion latency)
  • Rows copied below the expected minimum

Use ADF system variables in logs:

@pipeline().RunId
@pipeline().TriggerTime
@activity('CopyOrders').output.rowsCopied

Production lessons

  1. Always parameterize — never hardcode table or schema names
  2. Key Vault mandatory — no exceptions for credentials
  3. Watermark in SQL table, not pipeline parameter — avoids race conditions in parallel runs
  4. ForEach with batchCount — control parallelism; batchCount: 5 is a good starting point
  5. Tumbling Window over Schedule — for incremental pipelines with window SLA