AWS Neuron is looking to develop, enable and performance tune building blocks for all key ML model families, including Llama3, GPT OSS, Qwen3, DeepSeek and beyond, to solve the problem of creating high-performance distributed inference solutions for the latest generation Trainium accelerators.
Requirements
- Experience optimizing LLM inference performance with kernels, Python, PyTorch or JAX
- Experience programming with at least one software programming language
- Experience with design or architecture (design patterns, reliability and scaling) of new and existing systems
- Experience with full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations
- Experience with compiler runtime engineering
- Experience with distributed inference solutions
- Experience with machine learning accelerators
Responsibilities
- Develop optimized building blocks for the Neuron distributed inference library, tuning them to ensure highest performance and maximize efficiency running on Trn2 and Trn3 servers
- Create metrics, implement automation and other improvements, and resolve the root cause of software defects
- Participate in design discussions, code review, and communicate with internal and external stakeholders
- Work cross-functionally with teams across Neufon in a fast-paced startup-like development environment
- Develop technology components
- Optimize LLM inference performance with kernels, Python, PyTorch or JAX
- Build and tune high-performance distributed inference solutions for the latest generation Trainium accelerators
Other
- 3+ years of non-internship professional software development experience
- 2+ years of non-internship design or architecture experience
- Bachelor's degree in computer science or equivalent
- Ability to work in a fast-paced startup-like development environment
- Ability to communicate with internal and external stakeholders