What is ADLS Gen2?
Azure Data Lake Storage Gen2 (ADLS Gen2) is Microsoft's object storage optimized for analytical workloads. It combines Azure Blob Storage with the Hierarchical Namespace (HNS), which enables efficient directory operations that Spark performance depends on.
Without HNS, renaming a folder with 10 million files is an O(n) operation — it copies and deletes each file. With HNS, it's O(1): an atomic metadata operation.
Recommended container structure
storage_account: datalakeprod
├── bronze/ ← raw data, immutable
├── silver/ ← cleansed and standardized data
├── gold/ ← consumption-ready data
├── landing/ ← temporary landing zone (before Bronze)
├── checkpoints/ ← Spark Streaming checkpoints
└── logs/ ← pipeline execution logs
Using separate containers per layer, rather than folders inside a single container, simplifies permission management with RBAC/ACLs and lifecycle policies for retention and tiering.
Folder hierarchy inside each container
Bronze
bronze/
└── {source_system}/
└── {entity}/
└── {year}/{month}/{day}/
└── {timestamp}_{batch_id}.parquet
Real example:
bronze/
└── sqlserver_orders/
└── orders/
└── 2026/04/01/
├── 20260401_080000_batch001.parquet
└── 20260401_090000_batch002.parquet
└── customers/
└── 2026/04/01/
Partitioning by date in Bronze enables targeted reprocessing, lifecycle policies, and data arrival auditing.
Silver
silver/
└── {domain}/
└── {entity}/
└── (Delta table — managed by Delta Log)
In Silver, Delta tables are managed. Don't create manual subfolders — Delta handles partitioning internally.
Gold
gold/
└── {subject_area}/
└── {mart_name}/
Connecting in Spark
Via ABFS (recommended)
# Azure Blob FileSystem — native protocol for ADLS Gen2
path = "abfss://silver@datalakeprod.dfs.core.windows.net/orders/"
df = spark.read.format("delta").load(path)
Authentication with Service Principal
# Configure on cluster or notebook
spark.conf.set(
"fs.azure.account.auth.type.datalakeprod.dfs.core.windows.net",
"OAuth"
)
spark.conf.set(
"fs.azure.account.oauth.provider.type.datalakeprod.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
spark.conf.set(
"fs.azure.account.oauth2.client.id.datalakeprod.dfs.core.windows.net",
dbutils.secrets.get("kv-prod", "sp-client-id")
)
spark.conf.set(
"fs.azure.account.oauth2.client.secret.datalakeprod.dfs.core.windows.net",
dbutils.secrets.get("kv-prod", "sp-client-secret")
)
spark.conf.set(
"fs.azure.account.oauth2.client.endpoint.datalakeprod.dfs.core.windows.net",
f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
)
With Unity Catalog, use External Locations — don't configure Spark authentication manually.
Access control: RBAC vs ACLs
RBAC (Role-Based Access Control)
Assigned at subscription, resource group, or storage account level. Too coarse for fine-grained Data Lake access.
| Role | Recommended use | |-------------------------------|--------------------------------------| | Storage Blob Data Owner | Storage account admin | | Storage Blob Data Contributor | Ingestion services (ADF, pipelines) | | Storage Blob Data Reader | Read-only consumers |
ACLs (Access Control Lists)
Granularity at container, folder, or file level. Ideal for fine-grained control per layer:
# Using Azure CLI
# Grant read+execute on silver/orders folder to analytics group
az storage fs access set \
--acl "group:analytics-team:r-x" \
--path "orders" \
--file-system "silver" \
--account-name "datalakeprod"
# Default ACL (applies to new files created inside the folder)
az storage fs access set \
--acl "default:group:analytics-team:r-x" \
--path "orders" \
--file-system "silver" \
--account-name "datalakeprod"
A practical rule: use RBAC for service access (ADF = Contributor on Bronze) and ACLs for team or user access (analysts = Reader on Gold).
Lifecycle policies
Configure automatic tiering policies to reduce cost:
{
"rules": [
{
"name": "bronze-tiering",
"type": "Lifecycle",
"definition": {
"filters": { "blobTypes": ["blockBlob"], "prefixMatch": ["bronze/"] },
"actions": {
"baseBlob": {
"tierToCool": { "daysAfterModificationGreaterThan": 30 },
"tierToArchive": { "daysAfterModificationGreaterThan": 90 }
}
}
}
}
]
}
Bronze older than 30 days moves to Cool (40% cheaper). After 90 days, Archive (80% cheaper). Silver and Gold data typically stays in Hot due to frequent access.
Performance tips
- Avoid many small files — use OPTIMIZE in Delta to consolidate
- Partition by frequent filter columns, but avoid high cardinality (don't partition by order_id)
- Use ABFS (
abfss://) — never WASB (wasbs://), which is legacy and slower - Prefer Managed Identity over Service Principal when possible — fewer secrets to manage
- HNS is mandatory — create the storage account with Hierarchical Namespace from day one; migrating later is painful