The ETL Developer role at Citigroup is looking to solve the problem of designing, implementing, and optimizing distributed data processing jobs to handle large-scale data in Hadoop Distributed File System (HDFS) using Apache Spark and Python, to meet specific business needs or user areas.
Requirements
- Experience in systems analysis and programming of software applications
- Working knowledge of consulting/project management techniques/methods
- Deep understanding of data engineering principles
- Proficiency in Python
- Hands-on experience with Spark and Hadoop ecosystems
- Experience in managing and implementing successful projects
- Ability to work under pressure and manage deadlines or unexpected changes in expectations or requirements
Responsibilities
- Conduct tasks related to feasibility studies, time and cost estimates, IT planning, risk technology, applications development, model development, and establish and implement new or revised applications systems and programs to meet specific business needs or user areas
- Design and Implement of Spark applications to process and transform large datasets in HDFS
- Develop ETL Pipelines in Spark using Python for data Ingestion, cleaning, aggregation, and transformations
- Optimize Spark jobs for efficiency, reducing run time and resource usage
- Finetune memory management, caching, and partitioning strategies for Optimal performance
- Load data from different sources into HDFS, ensuring data accuracy and integrity
- Integrate Spark Applications with Hadoop frameworks like Hive, Sqoop etc
Other
- 5-8 years of relevant experience
- Bachelor’s degree/University degree or equivalent experience
- Ability to operate with a limited level of direct supervision
- Ability to exercise independence of judgement and autonomy
- Ability to work under pressure and manage deadlines or unexpected changes in expectations or requirements