Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

AMD Logo

Principal Software Development Engineer, ML Training and Performance

AMD

Salary not specified
Sep 4, 2025
San Jose, CA, US
Apply Now

AMD is looking to improve the performance of key applications and benchmarks by training large models on AMD GPUs and optimizing distributed training pipelines.

Requirements

  • Experience with distributed training pipelines
  • Knowledgeable in distributed training algorithms (Data Parallel, Tensor Parallel, Pipeline Parallel, ZeRO)
  • Familiar with training large models
  • Experience with ML frameworks such as PyTorch, JAX, or TensorFlow.
  • Experience with distributed training and distributed training frameworks, such as DeepSpeed, Megatron-LM.
  • Experience with LLMs, recommendation, or computer vision, especially large models, is a plus.
  • Excellent Python programming skills, including debugging, profiling, and performance analysis.

Responsibilities

  • Train large models to convergence on AMD GPUs.
  • Improve the end-to-end training pipeline performance.
  • Optimize the distributed training pipeline and algorithm to scale out.
  • Contribute your changes to open source.
  • Stay up-to-date with the latest training algorithms.
  • Influence the direction of AMD AI platform.

Other

  • Collaborate across teams with various groups and stakeholders.
  • Strong communication and problem-solving skills.
  • A Bachelor, Master's or Ph.D. degree in Computer Science, Artificial Intelligence, Machine Learning, or a related field.
  • San Jose, CA (hybrid)