Amazon's Machine Learning training infrastructure (ML Infra) team is looking to design, implement, and optimize large-scale computing infrastructure to power cutting-edge AI and machine learning initiatives.
Requirements
- 8+ years of professional software development experience in distributed systems with emphasis on ML infrastructure
- 8+ years of current programming experience building ML infrastructure using languages such as Python, C++ or Rust
- Hands-on experience with parallel computing platforms such as CUDA, OpenMP, etc
- Deep understanding of AI frameworks such as PyTorch, TensorFlow, and JAX, and their demands on underlying compute infrastructure, memory bandwidth, network interconnect, and storage as scale goes up
- Knowledge of emerging AI hardware accelerators and architectures
- Experience with containerization and orchestration technologies (Docker, Kubernetes)
- Experience with cloud computing platforms (AWS, Azure, GCP) and their offerings
Responsibilities
- Lead the definition, design, architecture quality, implementation, and delivery of the most advanced, most difficult, most cross-cutting, and/or most ambiguous challenges spanning across our ML infrastructure.
- Align the teams in ML Infrastructure and related organizations to a coherent technical vision and deliver systems that fit well together.
- Exert influence over multiple teams, increasing their productivity and effectiveness.
- Considered to be an authority on technical issues by both the technical and research community, you are responsible for guiding difficult trade-off decisions and drive awareness about the impact and consequences of technical decisions on AI research and product development.
- Demonstrate significant innovation, creativity, and judgement when solving challenging AI/ML infrastructure problems.
- Actively mentor senior and Principal engineers, scale yourself by developing and institutionalizing best practices in AI/ML infrastructure and distributed computing across the organization.
Other
- 5+ years of non-internship professional software development experience
- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- Experience as a mentor, tech lead or leading an engineering team
- Bachelor's degree in computer science or equivalent
- 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience