At AMD, the business problem is to accelerate next-generation computing experiences, from AI and data centers to PCs, gaming, and embedded systems, by building great products and solving the world's most important challenges.
Requirements
- Good experience with complex compute systems used in AI, HPC deployments, backend network designs in RDMA clusters
- Experience in validating complex AI infrastructure - GPUs, networking, ROCEv2, UEC, running benchmark tests like IBPerf benchmarking, RCCL, NCCL
- Experience with running training of LLMs, MoE models, Image Generation, recommendations models with different frameworks like PyTorch, Tensorflow, Megatron-LM, JAX
- Experience with running inference workloads in AI clusters with different inference frameworks like vLLM, SGLang
- Experience with distributed systems and schedulers like Kubernetes, Slurm
- Ability to write high-quality automation frameworks and scripts using Python or Golang
- Experience with performance profiling of CPUs, GPUs and debugging complex compute, network, storage problems
Responsibilities
- Work with AMD’s architecture specialists to validate AI solutions for distributed training and inference workloads with AMD's ROCM software
- Build cluster scale automation for distributed training and inference workloads
- Publish reference designs and benchmark numbers for AI workloads
- Apply a data-minded approach to target optimization efforts
- Design and develop new groundbreaking AMD technologies
- Participating in new ASIC and hardware bring-ups
- Develop technical relationships with peers and partners
Other
- Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
- Effective communication and problem-solving skills
- Leadership skills to drive sophisticated issues to resolution
- Ability to communicate effectively and work optimally with different teams across AMD
- AMD benefits at a glance