Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

AMD Logo

Director of Machine Learning Engineering -- Training and Performance

AMD

Salary not specified
Oct 29, 2025
San Jose, CA, United States of America
Apply Now

AMD is seeking to define and execute the technical vision for distributed training of large-scale generative AI and recommendation models on AMD GPUs, aiming to scale AI training efficiency, optimize model performance, and advance AMD's leadership in AI systems.

Requirements

  • Proven experience building and optimizing distributed training systems for large models.
  • Strong familiarity with ML frameworks (PyTorch, JAX, TensorFlow) and distributed frameworks (TorchTitan, Megatron-LM).
  • Hands-on expertise with LLMs, recommendation systems, or ranking models.
  • Proficiency in Python and C++, including performance profiling, debugging, and large-scale optimization.
  • Experience collaborating across hardware, compiler, and system software layers.
  • Prefer experience in both model and application-level development and optimization.

Responsibilities

  • Define and drive AMD’s distributed training strategy for large-scale generative and recommendation models.
  • Architect and optimize distributed training pipelines (Pre-training, SFT, RL etc.) for large-scale models.
  • Explore new approaches for efficient training and inference of LLMs and ranking systems.
  • Lead development of high-performance, reliable training pipelines that scale across thousands of GPUs.
  • Ensure world-class efficiency, stability, and model convergence.
  • Partner with compiler, runtime, system software, and hardware architecture teams to co-design solutions that maximize end-to-end performance.
  • Drive AMD’s engagement in open-source communities through contributions to frameworks such as PyTorch, JAX, TorchTitan, and Megatron-LM.

Other

  • Strategic Leadership & Vision
  • Team Leadership & Development
  • Open Source & External Engagement
  • Research & Trends
  • 10+ years in machine learning, distributed systems, or AI infrastructure; 5+ years in technical leadership or management roles.
  • Excellent communication, leadership, and problem-solving skills with the ability to influence across organizations and external partners.
  • Master’s or Ph.D. in Computer Science, Artificial Intelligence, Machine Learning, or a related field.