AWS Neuron is looking to solve the problem of developing, enabling and performance tuning of a wide variety of ML model families, including massive scale large language models, on the AWS Inferentia and Trainium cloud-scale machine learning accelerators
Requirements
- Experience programming with at least one software programming language
- Experience training large models using Python
- Familiarity with distributed training libraries such as FSDP and Deepspeed
- Experience with Pytorch and Jax
- Experience with XLA and the Neuron compiler and runtime stacks
- Strong software development and ML knowledge
- Experience with design patterns, reliability and scaling of new and existing systems
Responsibilities
- Development, enablement and performance tuning of a wide variety of ML model families
- Building distributed training support into Pytorch and Jax using XLA and the Neuron compiler and runtime stacks
- Tuning models to ensure highest performance and maximize the efficiency of them running on the customer AWS Trainium
- Leading the efforts building distributed training solutions with Trainium
- Creating, building and tuning distributed training solutions with Trainium
- Extending distributed training libraries such as FSDP and Deepspeed for the Neuron based system
- Ensuring highest performance and maximizing the efficiency of ML models running on the customer AWS Trainium
Other
- 3+ years of non-internship professional software development experience
- 2+ years of non-internship design or architecture of new and existing systems experience
- Bachelor's degree in computer science or equivalent
- Ability to work safely and cooperatively with other employees, supervisors, and staff
- Ability to communicate effectively and respectfully with employees, supervisors, and staff