Amazon Web Services (AWS) is looking for a Software Development Engineer II to build, deliver, and maintain complex products that delight customers and raise performance bars. The role involves designing fault-tolerant systems that run at massive scale to innovate best-in-class services and applications in the AWS Cloud, specifically focusing on the AWS Neuron software stack for machine learning accelerators.
Requirements
- Experience training these large models using Python is a must
- FSDP, Deepspeed and other distributed training libraries are central to this and extending all of this for the Neuron based system is key
- Strong software development and ML knowledge are both critical to this role
- 5+ years of programming with at least one software programming language experience
- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
Responsibilities
- building distributed training support into Pytorch, Tensorflow using XLA and the Neuron compiler and runtime stacks
- tune these models to ensure highest performance and maximize the efficiency of them running on the customer AWS Trainium and Inferentia silicon and the TRn1 , Inf1 servers
- development, enablement and performance tuning of a wide variety of ML model families, including massive scale large language models like GPT2, GPT3 and beyond, as well as stable diffusion, Vision Transformers and many more
- create , build and tune distributed training solutions with Trn1
- design fault-tolerant systems that run at massive scale
- decomposing problems to develop products that impact millions of people around the world
- identifying, defining, and building software solutions that revolutionize how businesses operate
Other
- Experience as a mentor, tech lead or leading an engineering team
- work safely and cooperatively with other employees, supervisors, and staff
- adhere to standards of excellence despite stressful conditions
- communicate effectively and respectfully with employees, supervisors, and staff to ensure exceptional customer service
- follow all federal, state, and local laws and Company policies