At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems.
Requirements
- Languages: Python, C, C++, Linux Shell scripting.
- Frameworks/Libraries: TensorFlow, PyTorch, ONNXRT
- Tools: Prior experience with Linux, Docker, Kubernetes,SLURM, LLVM compilers
- Good experience with complex computer systems used in AI, HPC deployments, backend network designs in RDMA clusters
- Experience in validating complex AI infrastructure - GPUs, networking, ROCEv2, UEC, running benchmark tests like IBPerf benchmarking, RCCL/NCCL.
- Experience with performance profiling of CPUs, GPUs and debugging complex compute, network, storage problems.
Responsibilities
- Work with AMD’s architecture specialists to validate AI solutions for distributed training and inference workloads with AMD's ROCM software
- Build cluster scale automation for distributed training and inference workloads
- Reproduce field defects and develop appropriate tests to prevent future issues.
- Design, develop and deploy testing tools and automation libraries necessary to perform testing.
- Lead the adoption of tooling and industry best practices by means of advocacy and outreach to help our development communities level up.
- Other duties as assigned
Other
- Bachelor's Degree or higher in Computer Science or related quantitative field.
- An advanced degree or equivalent practical work experience is a plus.
- This role is not eligible for visa sponsorship.
- Able to communicate effectively and work optimally with different teams across AMD.
- Leadership skills to drive sophisticated issues to resolution.