Building, managing, and scaling the infrastructure that supports machine learning workflows and cloud-based applications at Citi
Requirements
- Proven experience in building and managing ECS clusters
- Strong understanding of CI/CD processes and tools (e.g., LightSpeed, OpenShift, Harness)
- Proficiency in Docker for containerization
- Working knowledge of SQL (e.g., PostgreSQL, MySQL)
- Experience with AWS S3 for data management
- Solid Python skills for automation and scripting
Responsibilities
- Build and manage ECS clusters for production and experimental AI/ML environments
- Design, implement, and maintain CI/CD pipelines using tools such as LightSpeed, OpenShift, and Harness
- Containerize ML models and applications using Docker
- Support LLM-based validation pipelines, including setting up infrastructure for evaluation jobs and integrating results into monitoring dashboards
- Manage AWS S3 for datasets, model artifacts, and pipeline outputs, ensuring secure and efficient data access
- Proactively troubleshoot, performance tune, and scale systems
Other
- Bachelor’s degree in Computer Science, Engineering, or a related field
- 3+ years of experience in DevOps, MLOps, or Platform Engineering, supporting machine learning workflows and deploying cloud-based applications