Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

AMD Logo

Principal Software Development Engineer, ML Training and Performance

AMD

Salary not specified
Sep 29, 2025
San Jose, CA, US
Apply Now

AMD is looking to solve the challenge of training generative AI at scale by improving training efficiency and innovating new ideas for distributed training of large models on a large number of GPUs.

Requirements

  • Experience with distributed training pipelines
  • Knowledgeable in distributed training algorithms (Data Parallel, Tensor Parallel, Pipeline Parallel, ZeRO)
  • Familiar with training large models at scale.
  • Experience with ML frameworks such as PyTorch, JAX, or TensorFlow.
  • Experience with distributed training and distributed training frameworks, such as Megatron-LM, DeepSpeed.
  • Experience with LLMs or computer vision, especially large models, is a plus.
  • Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale.

Responsibilities

  • Train large models to convergence on AMD GPUs at scale.
  • Improve the end-to-end training pipeline performance.
  • Optimize the distributed training pipeline and algorithm to scale out.
  • Contribute your changes to open source.
  • Stay up-to-date with the latest training algorithms.
  • Influence the direction of AMD AI platform.
  • Collaborate across teams with various groups and stakeholders.

Other

  • A master's degree or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field.
  • San Jose, CA or Bellevue, WA preferred. May consider other US markets within proximity of US AMD offices.
  • Strong communication and problem-solving skills.