AMD is looking for a Principal Machine Learning Engineer to join their Models and Applications team to address the challenge of distributed training of large models on a large number of GPUs, and to improve training efficiency while innovating and generating new ideas for training generative AI at scale.
Requirements
- Experience with distributed training pipelines
- Knowledgeable in distributed training algorithms (Data Parallel, Tensor Parallel, Pipeline Parallel, ZeRO)
- Familiar with training large models at scale
- Experience with ML frameworks such as PyTorch, JAX, or TensorFlow.
- Experience with distributed training and distributed training frameworks, such as Megatron-LM, DeepSpeed.
- Experience with LLMs or computer vision, especially large models, is a plus.
- Excellent Python or C++ programming skills, including debugging, profiling, and performance analysis at scale.
Responsibilities
- Train large models to convergence on AMD GPUs at scale.
- Improve the end-to-end training pipeline performance.
- Optimize the distributed training pipeline and algorithm to scale out.
- Contribute your changes to open source.
- Stay up-to-date with the latest training algorithms.
- Influence the direction of AMD AI platform.
- Collaborate across teams with various groups and stakeholders.
Other
- A master's degree or PhD degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field.
- San Jose, CA or Bellevue, WA preferred. May consider other US markets within proximity of US AMD offices.
- Strong communication and problem-solving skills.