datarani

What is ADLS Gen2?

Azure Data Lake Storage Gen2 (ADLS Gen2) is Microsoft's object storage optimized for analytical workloads. It combines Azure Blob Storage with the Hierarchical Namespace (HNS), which enables efficient directory operations that Spark performance depends on.

Without HNS, renaming a folder with 10 million files is an O(n) operation: it copies and deletes each file. With HNS, it's O(1), an atomic metadata operation.

Recommended container structure

storage_account: datalakeprod
├── bronze/          ← raw data, immutable
├── silver/          ← cleansed and standardized data
├── gold/            ← consumption-ready data
├── landing/         ← temporary landing zone (before Bronze)
├── checkpoints/     ← Spark Streaming checkpoints
└── logs/            ← pipeline execution logs

Using separate containers per layer, rather than folders inside a single container, simplifies permission management with RBAC/ACLs and lifecycle policies for retention and tiering.

Folder hierarchy inside each container

Bronze

bronze/
└── {source_system}/
    └── {entity}/
        └── {year}/{month}/{day}/
            └── {timestamp}_{batch_id}.parquet

Real example:

bronze/
└── sqlserver_orders/
    └── orders/
        └── 2026/04/01/
            ├── 20260401_080000_batch001.parquet
            └── 20260401_090000_batch002.parquet
    └── customers/
        └── 2026/04/01/

Partitioning by date in Bronze enables targeted reprocessing, lifecycle policies, and data arrival auditing.

Silver

silver/
└── {domain}/
    └── {entity}/
        └── (Delta table — managed by Delta Log)

In Silver, Delta tables are managed. Don't create manual subfolders — Delta handles partitioning internally.

Gold

gold/
└── {subject_area}/
    └── {mart_name}/

Connecting in Spark

Via ABFS (recommended)

# Azure Blob FileSystem — native protocol for ADLS Gen2
path = "abfss://silver@datalakeprod.dfs.core.windows.net/orders/"

df = spark.read.format("delta").load(path)

Authentication with Service Principal

# Configure on cluster or notebook
spark.conf.set(
    "fs.azure.account.auth.type.datalakeprod.dfs.core.windows.net",
    "OAuth"
)
spark.conf.set(
    "fs.azure.account.oauth.provider.type.datalakeprod.dfs.core.windows.net",
    "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"
)
spark.conf.set(
    "fs.azure.account.oauth2.client.id.datalakeprod.dfs.core.windows.net",
    dbutils.secrets.get("kv-prod", "sp-client-id")
)
spark.conf.set(
    "fs.azure.account.oauth2.client.secret.datalakeprod.dfs.core.windows.net",
    dbutils.secrets.get("kv-prod", "sp-client-secret")
)
spark.conf.set(
    "fs.azure.account.oauth2.client.endpoint.datalakeprod.dfs.core.windows.net",
    f"https://login.microsoftonline.com/{tenant_id}/oauth2/token"
)

With Unity Catalog, use External Locations — don't configure Spark authentication manually.

Access control: RBAC vs ACLs

RBAC (Role-Based Access Control)

Assigned at subscription, resource group, or storage account level. Too coarse for fine-grained Data Lake access.

| Role | Recommended use | |-------------------------------|--------------------------------------| | Storage Blob Data Owner | Storage account admin | | Storage Blob Data Contributor | Ingestion services (ADF, pipelines) | | Storage Blob Data Reader | Read-only consumers |

ACLs (Access Control Lists)

Granularity at container, folder, or file level. Ideal for fine-grained control per layer:

# Using Azure CLI
# Grant read+execute on silver/orders folder to analytics group
az storage fs access set \
  --acl "group:analytics-team:r-x" \
  --path "orders" \
  --file-system "silver" \
  --account-name "datalakeprod"

# Default ACL (applies to new files created inside the folder)
az storage fs access set \
  --acl "default:group:analytics-team:r-x" \
  --path "orders" \
  --file-system "silver" \
  --account-name "datalakeprod"

A practical rule: use RBAC for service access (ADF = Contributor on Bronze) and ACLs for team or user access (analysts = Reader on Gold).

Lifecycle policies

Configure automatic tiering policies to reduce cost:

{
  "rules": [
    {
      "name": "bronze-tiering",
      "type": "Lifecycle",
      "definition": {
        "filters": { "blobTypes": ["blockBlob"], "prefixMatch": ["bronze/"] },
        "actions": {
          "baseBlob": {
            "tierToCool": { "daysAfterModificationGreaterThan": 30 },
            "tierToArchive": { "daysAfterModificationGreaterThan": 90 }
          }
        }
      }
    }
  ]
}

Bronze older than 30 days moves to Cool (40% cheaper). After 90 days, Archive (80% cheaper). Silver and Gold data typically stays in Hot due to frequent access.

Performance tips

Avoid many small files — use OPTIMIZE in Delta to consolidate
Partition by frequent filter columns, but avoid high cardinality (don't partition by order_id)
Use ABFS (abfss://) — never WASB (wasbs://), which is legacy and slower
Prefer Managed Identity over Service Principal when possible — fewer secrets to manage
HNS is mandatory — create the storage account with Hierarchical Namespace from day one; migrating later is painful

ADLS Gen2: container structure and organization for Lakehouse