Tools, frameworks, and experiments built while solving real data engineering problems in production.
Reusable PySpark library for incremental MERGE operations with UPSERT and SCD Type 2 support. Battle-tested on 50M+ row Delta tables in Databricks.
Parameterized Azure Data Factory pipeline templates for watermark-based CDC ingestion. Handles soft deletes, schema drift, and retry logic out of the box.
Terraform + Python scripts to provision Unity Catalog metastore, catalogs, schemas and permissions from a declarative YAML config. Reduces governance setup from days to hours.
CLI tool that validates incoming datasets against a YAML-defined data contract before Bronze ingestion. Catches schema drift and null violations at the source.
Streamlit dashboard connected to Databricks system tables and ADF run history. Shows pipeline SLA, row volume trends, and data freshness per table.