AMD is seeking to define and execute the technical vision for distributed training of large-scale generative AI and recommendation models on AMD GPUs, aiming to scale AI training efficiency, optimize model performance, and advance AMD's leadership in AI systems.
Requirements
- Proven experience building and optimizing distributed training systems for large models.
- Strong familiarity with ML frameworks (PyTorch, JAX, TensorFlow) and distributed frameworks (TorchTitan, Megatron-LM).
- Hands-on expertise with LLMs, recommendation systems, or ranking models.
- Proficiency in Python and C++, including performance profiling, debugging, and large-scale optimization.
- Experience collaborating across hardware, compiler, and system software layers.
- Prefer experience in both model and application-level development and optimization.
Responsibilities
- Define and drive AMD’s distributed training strategy for large-scale generative and recommendation models.
- Architect and optimize distributed training pipelines (Pre-training, SFT, RL etc.) for large-scale models.
- Explore new approaches for efficient training and inference of LLMs and ranking systems.
- Lead development of high-performance, reliable training pipelines that scale across thousands of GPUs.
- Ensure world-class efficiency, stability, and model convergence.
- Partner with compiler, runtime, system software, and hardware architecture teams to co-design solutions that maximize end-to-end performance.
- Drive AMD’s engagement in open-source communities through contributions to frameworks such as PyTorch, JAX, TorchTitan, and Megatron-LM.
Other
- Strategic Leadership & Vision
- Team Leadership & Development
- Open Source & External Engagement
- Research & Trends
- 10+ years in machine learning, distributed systems, or AI infrastructure; 5+ years in technical leadership or management roles.
- Excellent communication, leadership, and problem-solving skills with the ability to influence across organizations and external partners.
- Master’s or Ph.D. in Computer Science, Artificial Intelligence, Machine Learning, or a related field.