AMD is looking to develop multi-node GPU communication libraries to enable high-performance computing and machine learning workloads at Exascale.
Requirements
- Strong background developing applications and libraries in C, C++, and Python
- Experience working with RoCE(RDMA over Converge Ethernet), Libfabric and InfiniBand
- Experience working with Linux Kerner, Device drivers and network drivers.
- Experience designing and building GPU Networks for Large Scale Clusters
- Experience in collective communication libraries: MPI, RCCL, SHMEM and optimization to scale collective communication to scale distributed systems.
- In-depth knowledge of best-practices in software development, including testing, profiling, debugging, documentation, version control, issue tracking, and planning
- GPU software development using HIP, CUDA, or OpenCL
Responsibilities
- Support AMD’s RCCL, an open source, GPU-accelerated communication collective middleware and related technologies
- Design, implement, and test networking features for multi-GPU and multi-node communication libraries.
- Benchmark, profile and optimize code to maximize throughput on single-GPU, multi-GPU and clustered systems
- Deliver high-quality code and documentation following best practices for open source software development
- Work with key technical experts across AMD and with our partners and customers to improve ROCm applications, libraries, and tools
- Deploy the libraries on large clusters and debug complex system level issues that could span across different layers of the software stack: gpu kernel drivers, nic driver etc.
Other
- Accustomed to working in a dynamic, geographically distributed agile team, where partnership and collaboration are paramount.
- Possess excellent written and verbal communication skills, strong attention to detail, and the ability to express your work in a clear, cohesive fashion.
- Results-oriented and accustomed to tight deadlines and changing priorities.
- Constantly thinking of ways to improve performance of software and hardware.
- B.Sc. or B.Eng. degree in Computer Science, Software Engineering, Electrical Engineering, or equivalent