AMD is looking for an influential senior software engineer who is passionate about improving the performance of key applications and benchmarks.
Requirements
- Experience with distributed training pipelines
- Knowledgeable in distributed training algorithms (Data Parallel, Tensor Parallel, Pipeline Parallel, ZeRO)
- Familiar with training large models
- Experience with ML frameworks such as PyTorch, JAX, or TensorFlow.
- Experience with distributed training and distributed training frameworks, such as DeepSpeed, Megatron-LM.
- Experience with LLMs, recommendation, or computer vision, especially large models, is a plus.
- Excellent Python programming skills, including debugging, profiling, and performance analysis.
Responsibilities
- Train large models to convergence on AMD GPUs.
- Improve the end-to-end training pipeline performance.
- Optimize the distributed training pipeline and algorithm to scale out.
- Contribute your changes to open source.
- Stay up-to-date with the latest training algorithms.
- Influence the direction of AMD AI platform.
- Collaborate across teams with various groups and stakeholders.
Other
- A Bachelor, Master's or Ph.D. degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field.
- Strong communication and problem-solving skills.