Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Principal Software Development Engineer, ML Training and Performance

AMD

Salary not specified

Sep 30, 2025

San Jose, CA, US

AMD is looking for a Principal Machine Learning Engineer to join their Models and Applications team to address the challenge of distributed training of large models on a large number of GPUs, and to improve training efficiency while innovating and generating new ideas for training generative AI at scale.

Requirements

Experience with distributed training pipelines
Knowledgeable in distributed training algorithms (Data Parallel, Tensor Parallel, Pipeline Parallel, ZeRO)
Familiar with training large models at scale
Experience with ML frameworks such as PyTorch, JAX, or TensorFlow.
Experience with distributed training and distributed training frameworks, such as Megatron-LM, DeepSpeed.
Experience with LLMs or computer vision, especially large models, is a plus.
Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale.

Responsibilities

Train large models to convergence on AMD GPUs at scale.
Improve the end-to-end training pipeline performance.
Optimize the distributed training pipeline and algorithm to scale out.
Contribute your changes to open source.
Stay up-to-date with the latest training algorithms.
Influence the direction of AMD AI platform.
Collaborate across teams with various groups and stakeholders.

Other

A master's degree or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field.
San Jose, CA or Bellevue, WA preferred. May consider other US markets within proximity of US AMD offices.
Strong communication and problem-solving skills.