datarani

What is Azure Data Factory?

Azure Data Factory (ADF) is Microsoft's managed data integration service on Azure. It orchestrates data movement and transformation across 90+ native connectors: SQL Server, Oracle, SAP, REST APIs, blobs, and data lakes.

In a Lakehouse context, ADF is typically the entry point for data: it moves data from source systems into the Bronze layer of ADLS Gen2.

Core concepts

Linked Services

These are connections to external systems. Each source or destination needs a Linked Service with the correct credentials. Always use Azure Key Vault for secrets — never hardcode credentials in pipelines.

{
  "name": "ls_sqlserver_orders",
  "type": "AzureSqlDatabase",
  "typeProperties": {
    "connectionString": "@Microsoft.KeyVault(SecretUri=...)"
  }
}

Datasets

Datasets represent data structure: tables, files, or containers. Prefer parameterized datasets to avoid duplication:

{
  "name": "ds_sqlserver_table",
  "parameters": {
    "tableName": { "type": "string" },
    "schemaName": { "type": "string" }
  }
}

Activities

Activities are the units of work inside a pipeline: Copy Data, Lookup, ForEach, If Condition, Stored Procedure, and more.

CDC ingestion with watermark

The watermark strategy is the most common approach for incremental CDC on tables without native CDC enabled.

1. Control table

Create a table in Azure SQL to store the last processed watermark per table:

CREATE TABLE dbo.watermark_control (
    table_name    NVARCHAR(100) PRIMARY KEY,
    last_modified DATETIME2,
    updated_at    DATETIME2 DEFAULT GETDATE()
);

INSERT INTO dbo.watermark_control VALUES ('orders', '2020-01-01', GETDATE());

2. Pipeline structure

The incremental ingestion pipeline follows this flow:

Lookup (last watermark)
    ↓
Lookup (current max timestamp from source)
    ↓
Copy Data (WHERE modified_at > last_wm AND modified_at <= current_wm)
    ↓
Stored Procedure (updates watermark_control)

3. Parameterized Copy Data activity

-- Dynamic query in Copy Data activity
SELECT *
FROM @{dataset().schemaName}.@{dataset().tableName}
WHERE modified_at > '@{activity('GetLastWatermark').output.firstRow.last_modified}'
  AND modified_at <= '@{activity('GetCurrentWatermark').output.firstRow.max_modified}'

4. Update watermark

After the copy, run a Stored Procedure to record the new watermark:

CREATE PROCEDURE dbo.usp_update_watermark
    @table_name    NVARCHAR(100),
    @new_watermark DATETIME2
AS
UPDATE dbo.watermark_control
SET last_modified = @new_watermark,
    updated_at    = GETDATE()
WHERE table_name = @table_name;

Parameterization: one pipeline for multiple tables

Avoid creating one pipeline per table. With ForEach and parameters you ingest N tables with a single pipeline:

{
  "name": "tables_to_ingest",
  "type": "Array",
  "defaultValue": [
    { "schema": "dbo", "table": "orders",    "key": "order_id" },
    { "schema": "dbo", "table": "customers", "key": "customer_id" },
    { "schema": "dbo", "table": "products",  "key": "product_id" }
  ]
}

In the ForEach, access via @item().schema, @item().table, etc.

Handling soft deletes

Source systems often flag deleted records instead of physically deleting them:

-- Capture soft deletes in the same query
SELECT *, is_deleted
FROM dbo.orders
WHERE modified_at > '@{...}'
  AND modified_at <= '@{...}'

In the Silver layer, apply the logical delete logic via MERGE in Databricks.

Triggers and schedules

Tumbling Window Trigger

For incremental ingestion pipelines with fixed time windows, use Tumbling Window instead of Schedule Trigger. It maintains execution state per window, guarantees non-overlapping runs, and supports automatic retry per failed window — Schedule Trigger does none of this.

{
  "type": "TumblingWindowTrigger",
  "typeProperties": {
    "frequency": "Hour",
    "interval": 1,
    "startTime": "2026-01-01T00:00:00Z",
    "retryPolicy": { "count": 3, "intervalInSeconds": 30 }
  }
}

Monitoring

Configure Azure Monitor alerts for:

Pipeline failures
Duration above expected threshold (ingestion latency)
Rows copied below the expected minimum

Use ADF system variables in logs:

@pipeline().RunId
@pipeline().TriggerTime
@activity('CopyOrders').output.rowsCopied

Production lessons

Always parameterize. Hardcoded table or schema names will bite you the moment you add a second table. Key Vault is non-negotiable for credentials. Store the watermark in a SQL table, not in a pipeline parameter — a parameter doesn't survive concurrent runs, and you'll get duplicates. Set batchCount on your ForEach to control parallelism; 5 is a reasonable starting point. For incremental pipelines with a window SLA, use Tumbling Window, not Schedule Trigger.

Azure Data Factory in practice: CDC ingestion pipelines from zero to prod