datarani

I started as a data analyst. SQL queries, Excel pivots, Power BI dashboards. I was good at it. Then I decided I wanted to build the infrastructure that feeds those dashboards, not just consume it. This article is an honest account of that transition — what I expected, what actually mattered, and what I got completely wrong.

Where I started

My first "engineering" work was writing Python scripts to automate report exports. Scheduled with Windows Task Scheduler, output to a shared network folder. It worked. I called it a pipeline. It wasn't.

The jump to actual data engineering happened when I joined a team that was building a Lakehouse from scratch on Azure. I was the most junior person on the team. I knew Python and SQL. I had no idea what Delta Lake was, what CDC meant in practice, or why anyone would use Spark when pandas worked fine on my laptop.

That gap between "I know Python and SQL" and "I can build and operate production data pipelines" is much larger than it looks from the outside. Understanding it is step one.

The first real shift: pipelines are software

The most important mindset change is realizing that data pipelines are software products. They need to be:

Idempotent: running the same pipeline twice produces the same result
Observable: failures are detectable, diagnosable, and alertable
Testable: you can verify behavior without running against production data
Versioned: changes are tracked, rollbacks are possible

When I was an analyst, I wrote scripts. Scripts have none of these properties. A script that runs once and produces output is a success. A pipeline that runs daily for three years, handles schema changes, recovers from failures, and never silently produces wrong data — that's engineering.

The transition required changing how I thought about every piece of code I wrote. Not "does this produce the right output today?" but "what happens when this breaks at 3 AM six months from now?"

Skills that actually mattered

Distributed systems intuition. You don't need to understand Spark's internal execution engine at a PhD level, but you need to feel when something is going to be a bad idea at scale. Joins on non-partitioned tables. Collecting large dataframes to the driver. Nested loops over Spark DataFrames. These feel natural when you think in pandas and they're catastrophic in PySpark. The mental model shift from single-machine to distributed is foundational.

Idempotency as a habit. Every pipeline should be safely re-runnable. If it fails halfway through and you rerun it, the output should be identical to a clean run. This sounds obvious; implementing it consistently is harder. Use overwrite mode instead of append for full loads, merge-based upserts for incremental loads, and watermarks that tolerate being reset.

Observability before features. Early in my engineering career I focused on pipeline functionality. Get the data from A to B, transform it correctly, done. Observability was an afterthought — I'd add logging "later". Later never comes. Now I instrument first: row counts logged at each stage, watermarks recorded, latency tracked, alerting configured before the first production run. An unmonitored pipeline is a ticking clock.

Understanding storage. Knowing how Parquet files work, why small files are expensive, what Z-ordering does, when to partition and when not to — this knowledge pays compound interest. Every performance problem I've debugged ultimately came back to how data was physically organized on disk.

Skills I overestimated

Being the best PySpark coder. I spent months deep-diving into Spark optimization, reading about AQE, broadcast joins, and RDD internals. Useful knowledge. But the actual bottleneck in most pipelines is not Spark performance — it's data architecture decisions made before a single line of code is written. The schema design, the partitioning strategy, the choice between incremental and full load — those decisions matter more than whether you use repartition or coalesce.

Knowing every tool. Early on I felt pressure to know dbt, Airflow, Kafka, and every new tool that appeared in job descriptions. The tools change. The underlying concepts don't. Understand CDC deeply and you can implement it in ADF, Debezium, or Kafka Connect. Understand the Medallion Architecture and you can implement it in Delta Lake, Iceberg, or Hudi.

Moving fast. Fast pipelines built without proper foundations create technical debt that multiplies. The most expensive code I've ever written was "quick" pipeline code that worked fine for six months and then required three weeks of rework when it failed at scale. Invest in the foundation.

The Tech Lead transition

Becoming a Tech Lead changed what "doing a good job" meant. As an individual contributor, a good job meant clean, working code. As a Tech Lead, a good job meant a team that consistently delivers clean, working code — even when I'm not in the room.

The hardest part is resisting the urge to solve every technical problem yourself. A team that depends on you to unblock every hard problem doesn't scale. Your job shifts from writing the best code to creating the conditions where everyone on the team can write good code: clear standards, useful code reviews, shared patterns, and documentation that exists.

Technical decisions also become communication problems. The right architecture decision is the one the team understands, can implement, and can maintain. A brilliant solution that only you understand is a liability.

What I'd do differently

Learn data contracts earlier. I spent years treating data quality as a downstream problem. Catch bad data in Silver, fix it before Gold. The right model is upstream enforcement: a contract between producer and consumer, validation before Bronze. This changes the economics of data quality: a violation caught at ingestion costs minutes; the same violation caught in a Gold dashboard costs days.

Write tests from day one. I have no good excuse for the years I ran untested pipelines in production. Unit tests for transformation logic, integration tests against a sample dataset, data quality assertions that run on every pipeline execution. The testing investment pays back in the first incident it prevents.

Document decisions, not just code. Code explains what. Documentation should explain why. Why this partitioning strategy? Why watermark instead of full load? Why this schema design? These decisions made sense when you made them. Six months later, when you're debugging at 2 AM or onboarding a new team member, the why matters as much as the what.

Advice for analysts considering the switch

The transition is worth it if you genuinely enjoy building systems, not just consuming them. If your satisfaction comes from fixing the root cause rather than working around it, from understanding how the infrastructure works, from building something that outlasts any individual analysis — engineering might be the right direction.

It's not worth it if you're chasing a salary bump or a title. The work is different enough that skills you've built as an analyst don't automatically transfer. The SQL knowledge transfers. The business domain knowledge transfers. The "make it work for this one case" mindset does not.

Start building. Don't wait until you feel ready. Rebuild a pipeline you depend on as an analyst, from scratch, with proper engineering standards. That gap between what you have now and what a production-grade pipeline looks like is exactly the gap you need to close.