What is Azure Data Factory?
Azure Data Factory (ADF) is Microsoft's managed data integration service on Azure. It orchestrates data movement and transformation across 90+ native connectors — from SQL Server, Oracle and SAP to REST APIs, blobs and data lakes.
In a Lakehouse context, ADF is typically the entry point for data: it moves data from source systems into the Bronze layer of ADLS Gen2.
Core concepts
Linked Services
These are connections to external systems. Each source or destination needs a Linked Service with the correct credentials. Always use Azure Key Vault for secrets — never hardcode credentials in pipelines.
{
"name": "ls_sqlserver_orders",
"type": "AzureSqlDatabase",
"typeProperties": {
"connectionString": "@Microsoft.KeyVault(SecretUri=...)"
}
}
Datasets
Datasets represent data structure — tables, files or containers. Prefer parameterized datasets to avoid duplication:
{
"name": "ds_sqlserver_table",
"parameters": {
"tableName": { "type": "string" },
"schemaName": { "type": "string" }
}
}
Activities
Activities are the units of work inside a pipeline: Copy Data, Lookup, ForEach, If Condition, Stored Procedure, and more.
CDC ingestion with watermark
The watermark strategy is the most common approach for incremental CDC on tables without native CDC enabled.
1. Control table
Create a table in Azure SQL to store the last processed watermark per table:
CREATE TABLE dbo.watermark_control (
table_name NVARCHAR(100) PRIMARY KEY,
last_modified DATETIME2,
updated_at DATETIME2 DEFAULT GETDATE()
);
INSERT INTO dbo.watermark_control VALUES ('orders', '2020-01-01', GETDATE());
2. Pipeline structure
The incremental ingestion pipeline follows this flow:
Lookup (last watermark)
↓
Lookup (current max timestamp from source)
↓
Copy Data (WHERE modified_at > last_wm AND modified_at <= current_wm)
↓
Stored Procedure (updates watermark_control)
3. Parameterized Copy Data activity
-- Dynamic query in Copy Data activity
SELECT *
FROM @{dataset().schemaName}.@{dataset().tableName}
WHERE modified_at > '@{activity('GetLastWatermark').output.firstRow.last_modified}'
AND modified_at <= '@{activity('GetCurrentWatermark').output.firstRow.max_modified}'
4. Update watermark
After the copy, run a Stored Procedure to record the new watermark:
CREATE PROCEDURE dbo.usp_update_watermark
@table_name NVARCHAR(100),
@new_watermark DATETIME2
AS
UPDATE dbo.watermark_control
SET last_modified = @new_watermark,
updated_at = GETDATE()
WHERE table_name = @table_name;
Parameterization: one pipeline for multiple tables
Avoid creating one pipeline per table. With ForEach and parameters you ingest N tables with a single pipeline:
{
"name": "tables_to_ingest",
"type": "Array",
"defaultValue": [
{ "schema": "dbo", "table": "orders", "key": "order_id" },
{ "schema": "dbo", "table": "customers", "key": "customer_id" },
{ "schema": "dbo", "table": "products", "key": "product_id" }
]
}
In the ForEach, access via @item().schema, @item().table, etc.
Handling soft deletes
Source systems often flag deleted records instead of physically deleting them:
-- Capture soft deletes in the same query
SELECT *, is_deleted
FROM dbo.orders
WHERE modified_at > '@{...}'
AND modified_at <= '@{...}'
In the Silver layer, apply the logical delete logic via MERGE in Databricks.
Triggers and schedules
Tumbling Window Trigger
For incremental ingestion pipelines with fixed time windows, Tumbling Window is better than Schedule Trigger because:
- It maintains execution state per window
- It guarantees non-overlapping runs
- It supports automatic retry per failed window
{
"type": "TumblingWindowTrigger",
"typeProperties": {
"frequency": "Hour",
"interval": 1,
"startTime": "2026-01-01T00:00:00Z",
"retryPolicy": { "count": 3, "intervalInSeconds": 30 }
}
}
Monitoring
Configure Azure Monitor alerts for:
- Pipeline failures
- Duration above expected threshold (ingestion latency)
- Rows copied below the expected minimum
Use ADF system variables in logs:
@pipeline().RunId
@pipeline().TriggerTime
@activity('CopyOrders').output.rowsCopied
Production lessons
- Always parameterize — never hardcode table or schema names
- Key Vault mandatory — no exceptions for credentials
- Watermark in SQL table, not pipeline parameter — avoids race conditions in parallel runs
- ForEach with batchCount — control parallelism;
batchCount: 5is a good starting point - Tumbling Window over Schedule — for incremental pipelines with window SLA