Annapurna Labs, an integral part of AWS, is seeking an experienced engineer to work on distributed AI/ML systems, specifically on collective operations that enable AI to scale across multiple accelerators and servers, to support the development of hardware and software components critical to EC2 infrastructure.
Requirements
- Solid knowledge of Linux, kernels, and performant code
- Experience with embedded systems
- Experience with high-speed networking or HPC interconnects
- Programming experience with at least one software programming language, preferably C/C++
- Experience with design or architecture of new and existing systems
- Experience with full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations
- Experience with cloud computing, preferably with AWS
Responsibilities
- Working on collective operations, the fundamental operations that enable AI to scale across multiple accelerators and servers
- Developing software components for EC2 infrastructure
- Building networking solutions for Machine Learning (ML) and High-Performance Computing (HPC) workloads on AWS
- Collaborating with infrastructure experts, hardware engineers, RTL engineers, scientists, and architects
- Working on features for the largest clusters, with the largest customers, for the largest AI models
- Developing and maintaining performant code, with solid knowledge of Linux, kernels, and embedded systems
- Iterating fast and delivering meaningful solutions at scale
Other
- 3+ years of non-internship professional software development experience
- 2+ years of non-internship design or architecture experience
- Bachelor's degree in computer science or equivalent
- Ability to work safely and cooperatively with other employees, supervisors, and staff
- Ability to communicate effectively and respectfully with employees, supervisors, and staff