Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Amazon Web Services Logo

Software Engineer- AI/ML, AWS Neuron Distributed Training

Amazon Web Services

$151,300 - $261,500
Oct 28, 2025
Cupertino, CA, US
Apply Now

Amazon Web Services (AWS) is looking for a Software Development Engineer II to build, deliver, and maintain complex products that delight customers and raise performance bars. The role involves designing fault-tolerant systems that run at massive scale to innovate best-in-class services and applications in the AWS Cloud, specifically focusing on the AWS Neuron software stack for machine learning accelerators.

Requirements

  • Experience training these large models using Python is a must
  • FSDP, Deepspeed and other distributed training libraries are central to this and extending all of this for the Neuron based system is key
  • Strong software development and ML knowledge are both critical to this role
  • 5+ years of programming with at least one software programming language experience
  • 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience
  • 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience

Responsibilities

  • building distributed training support into Pytorch, Tensorflow using XLA and the Neuron compiler and runtime stacks
  • tune these models to ensure highest performance and maximize the efficiency of them running on the customer AWS Trainium and Inferentia silicon and the TRn1 , Inf1 servers
  • development, enablement and performance tuning of a wide variety of ML model families, including massive scale large language models like GPT2, GPT3 and beyond, as well as stable diffusion, Vision Transformers and many more
  • create , build and tune distributed training solutions with Trn1
  • design fault-tolerant systems that run at massive scale
  • decomposing problems to develop products that impact millions of people around the world
  • identifying, defining, and building software solutions that revolutionize how businesses operate

Other

  • Experience as a mentor, tech lead or leading an engineering team
  • work safely and cooperatively with other employees, supervisors, and staff
  • adhere to standards of excellence despite stressful conditions
  • communicate effectively and respectfully with employees, supervisors, and staff to ensure exceptional customer service
  • follow all federal, state, and local laws and Company policies