Design, build, and operationalize scalable data processing and ML-ready pipelines.
Requirements
- AWS services (EMR/SageMaker , Lambda, RedShift, Glue, SNS, SQS)
- PySpark and data processing frameworks
- Shell scripting and Python development
- CI/CD tooling experience (Jenkins, UCD)
- Source control experience with Bitbucket and GitHub
- Experience building and maintaining scripts/tools for automation
- Familiarity with AWS ECS
- Experience with Aurora PostgreSQL
- Java for tooling or pipeline components
Responsibilities
- Design and implement scalable ETL/ELT pipelines on AWS for batch and near-real-time workloads.
- Build and optimize data processing jobs using PySpark on EMR and Glue.
- Develop and manage RedShift schemas, queries, and Spectrum for external table access.
- Integrate machine learning workflows with SageMaker and Lambda-driven orchestration.
- Automate deployments and testing using CI/CD tools and source control (Jenkins, UCD, Bitbucket, GitHub).
- Create and maintain operational scripts and tooling (Shell, Python) for monitoring, troubleshooting, and performance tuning.
Other
- Day 1 onsite – 5 days in a week