Senior PySpark ETL Engineer

The Senior PySpark ETL Engineer is responsible for designing, building, optimizing, and operating scalable data pipelines using Apache Spark (PySpark).

This role focuses on highvolume batch (and optionally streaming) data processing, ensuring performance, reliability, data quality, and cost efficiency across enterprise data platforms.

The position requires strong python, handson Spark expertise, deep SQL and data modeling knowledge, and the ability to own pipelines endtoend in production.

Mandatory Requirements :

– 10 to 12 years of overall IT experience, with strong focus on data engineering and ETL.

– 3+ years of handson experience with PySpark / Apache Spark in production environments.

– Strong experience designing and implementing ETL / ELT pipelines at scale.

– Excellent knowledge of SQL and relational data concepts.

– Experience handling large datasets in distributed environments.

– Strong ownership mindset, problemsolving skills, and ability to independently handle production pipelines.

Core Technical Skills :

PySpark & Spark Engineering :

– Deep expertise in PySpark :

1. DataFrames, Spark SQL, window functions, joins, aggregations

2. Spark execution model (DAGs, stages, tasks)

– Strong handson experience with :

1. Partitioning strategies

2. Shuffle optimization

3. Broadcast vs sortmerge joins

4. Caching / persisting

5. Handling data skew and memory spills

– Proven ability to debug and optimize slow Spark jobs.

ETL & Data Engineering :

– Strong knowledge of ETL/ELT design patterns :

1. Incremental loads

2. Watermarking

3. Idempotent pipeline design

4. Reprocessing and backfill strategies

– Experience implementing :

1. SCD Type 1 / Type 2

2. Deduplication and latearriving data handling

3. Ability to design reusable transformation frameworks and common utilities.

4. Experience building sourcetotarget reconciliation and data quality checks.

Data Storage & SQL :

Excellent SQL skills including :

1. Complex joins

2. Subqueries and CTEs

3. Window functions

4. Query optimization

– Experience working with :

1. RDBMS sources (Postgres, MySQL)

2. Data lake storage using Parquet / ORC

3. Experience with partitioned datasets and compaction strategies.

Cloud & Big Data Platforms :

– Handson experience with at least one Spark platform :

1. AWS EMR

2. Spark on Kubernetes

– Experience working with cloud storage :

1. S3

– Familiarity with orchestration tools :

1. Airflow, Databricks Workflows, ADF, or equivalent.

Responsibilities :

Pipeline Development & Ownership :

– Design, implement, and maintain highperformance PySpark ETL pipelines.

– Own pipelines endtoend, including development, deployment, monitoring, and production support.

– Ensure pipelines are scalable, faulttolerant, and rerunnable.

– Implement incremental processing and efficient data movement strategies.

Performance & Reliability :

– Identify and fix Spark performance bottlenecks.

– Optimize resource usage and reduce execution time and cost.

– Handle production issues related to :

1. Job failures

2. Data corruption

3. SLA breaches

– Perform root cause analysis and implement permanent fixes.

Data Quality & Governance :

– Implement strong data quality validations, checks, and reconciliation mechanisms.

– Ensure correctness, completeness, and freshness of datasets.

– Follow enterprise standards for :

1. Data retention

2.Auditability

3. Schema evolution

Are you interested in this position?

Apply by clicking on the “Apply Now” button below!

#AlbionarcJobs#FintechJobs

#AsiaJobs#MiddleEastCareers

#TechTalent#FintechRecruitment

#FinanceOpportunities#