The Amazon AGI SF Lab is focused on developing new foundational capabilities for enabling useful AI agents that can take actions in the digital and physical worlds.
Requirements
- Experience programming in Java, C++, Python or related language
- Experience with neural deep learning methods and machine learning
- Experience debugging ML systems
- PhD in Computer Science, Machine Learning, or a related field, with a focus on ML System.
- Demonstrated experience in developing, implementing and debugging large scale ML systems.
- Experience with distributed system, Megatron, vLLM, Ray, and working with GPUs.
Responsibilities
- Develop cutting-edge training infrastructure to ensure large-scale reinforcement learning on LLMs runs highly efficient and robust.
- Work across the entire technology stack, including low level ML system, job orchestration and data management.
- Analyze, troubleshoot and profiling complex ML systems, identify and address performance bottlenecks.
- Work closely with researchers, conduct MLSys research to create new techniques, infrastructure, and tooling around emerging research capabilities.
Other
- PhD, or Master's degree and 3+ years of applied research experience
- Work safely and cooperatively with other employees, supervisors, and staff;
- Adhere to standards of excellence despite stressful conditions;
- Communicate effectively and respectfully with employees, supervisors, and staff to ensure exceptional customer service;
- Follow all federal, state, and local laws and Company policies.