AMD is looking to validate multi-node GPU communication HW/FW and SW libraries to enable high-performance computing and machine learning workloads.
Requirements
- Strong background developing applications and libraries in C, C++, and Python
- Experience working with RoCE(RDMA over Converge Ethernet), Libfabric and InfiniBand among others
- Experience working with Linux Kerner, Device drivers and network drivers.
- Experience designing and building GPU Networks for Large Scale Clusters
- Experience in collective communication libraries: MPI, RCCL, SHMEM and optimization to scale collective communication to scale distributed systems.
- GPU software development using HIP, CUDA, or OpenCL
- Understanding of CPU, GPU and NIC architecture and low-level optimization techniques including assembly programming and/or vectorization
Responsibilities
- Become the liaison between the AMD engineers that work at the GPU, CPU and NIC components to help integrate solutions at the server-level, rack-level and cluster-level.
- Support validation of servers with AMD CPU/GPU/NICs and AMD’s libraries such as RCCL
- Design, implement, and test networking features for multi-GPU and multi-node communication libraries, both scale-up and scale-out
- Benchmark, profile and optimize code to maximize throughput on single-GPU, multi-GPU and clustered systems
- Deploy the libraries on large clusters and debug complex system level issues that could span across different layers of the software stack: particularly NIC kernel driver and GPU kernel drivers.
Other
- You are accustomed to working in a dynamic, geographically distributed agile team, where partnership and collaboration are paramount.
- You possess excellent written and verbal communication skills, strong attention to detail, and the ability to express your work in a clear, cohesive fashion.
- You are results-oriented and accustomed to tight deadlines and changing priorities.
- Most importantly, you are constantly thinking of ways to improve performance of software and hardware.
- In-depth knowledge of best practices in software development, including testing, profiling, debugging, documentation, version control, issue tracking, and planning