Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Amazon Web Services Logo

Software Development Manager, AWS Neuron Machine Learning Distributed Training - Model Enablement

Amazon Web Services

$166,400 - $287,700
Dec 12, 2025
Cupertino, CA, US
Apply Now

AWS Neuron is looking to solve the problem of designing and deploying new products for machine learning accelerators and servers, specifically the AWS Inferentia and Trainium cloud-scale machine learning accelerators and the Trn1 and Inf1 servers that use them.

Requirements

  • 3+ Years of Deep Learning/Machine learning experience
  • Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
  • Experience with Pytorch, XLA, JAX and distributed training libraries like FSDP, DDP
  • Experience with designing or architecting (design patterns, reliability and scaling) of new and existing systems
  • Experience with leading the definition and development of multi tier web services
  • Experience partnering with product or program management teams
  • Experience in communicating with users, other technical teams, and senior leadership to collect requirements, describe software product features, technical designs, and product strategy

Responsibilities

  • Solve challenging technical problems, often ones not solved before, at every layer of the stack.
  • Design, implement, test, deploy and maintain innovative software solutions to transform service performance, durability, cost, and security.
  • Build high-quality, highly available, always-on products.
  • Research implementations that deliver the best possible experiences for customers.
  • Lead the way to ensure support for key ML functionality in a combined chip / software platform
  • Ensure the right thing is being built and delivered to customers
  • Responsible for the full development life cycle of our integrations and extensions for inference and training support in Pytorch, XLA, JAX as well as distributed training libraries like FSDP, DDP and others.

Other

  • 3+ years of engineering team management experience
  • 7+ years of working directly within engineering teams experience
  • 8+ years of leading the definition and development of multi tier web services experience
  • Experience recruiting, hiring, mentoring/coaching and managing teams of Software Engineers
  • Ability to work safely and cooperatively with other employees, supervisors, and staff