AWS Neuron is looking to solve the problem of development, enablement and performance tuning of a wide variety of ML model families, including massive scale large language models, on the AWS Inferentia and Trainium cloud-scale machine learning accelerators and the Trn1 and Inf1 servers.
Requirements
- Experience programming with at least one software programming language
- Experience training large models using Python
- FSDP, Deepspeed and other distributed training libraries
- Pytorch, Tensorflow, Jax using XLA
- Neuron compiler and runtime stacks
- Strong software development and ML knowledge
- Experience with design patterns, reliability and scaling of new and existing systems
Responsibilities
- Help lead the efforts building distributed training and inference support into Pytorch, Tensorflow, Jax using XLA and the Neuron compiler and runtime stacks
- Tune these models to ensure highest performance and maximize the efficiency of them running on the customer AWS Trainium and Inferentia silicon and the TRn1 , Inf1 servers
- Develop, enablement and performance tuning of a wide variety of ML model families, including massive scale large language models like GPT2, GPT3 and beyond
- Create, build and tune distributed training solutions with Trn1
- Extend distributed training libraries like FSDP, Deepspeed for the Neuron based system
- Work side by side with chip architects, compiler engineers and runtime engineers to create and tune distributed training solutions
- Ensure highest performance and maximize the efficiency of ML models running on the customer AWS Trainium and Inferentia silicon and the TRn1 , Inf1 servers
Other
- 3+ years of non-internship professional software development experience
- 2+ years of non-internship design or architecture of new and existing systems experience
- Bachelor's degree in computer science or equivalent
- Ability to work in a team and collaborate with others
- Strong communication and problem-solving skills