Oracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high-performance network to support AI/ML/HPC workloads. The team needs to design systems that scale from tens to hundreds of thousands of GPUs without sacrificing performance.
Requirements
- Strong knowledge and practical experience with NCCL is essential for this role.
- 1+ years of experience with collective communications libraries like NCCL, RCCL, MPI and GPU frameworks like CUDA and ROCm.
- 1+ years of experience with ML training frameworks like PyTorch, TensorFlow
- Proficient at programming in any two out of C/C++, Python, Java, Scala, GO
- Proficient with data structures, algorithms, operating systems
- Experience with RDMA programming, including but not limited to GPUDirect RDMA
- Experience with distributed workload managers like Slurm or K8s
Responsibilities
- design systems that scale from tens to hundreds of thousands of GPUs without sacrificing performance
- develops and tunes the software and hardware stack for distributed workloads using libraries such as NCCL on high-speed networks
- apply collective communication libraries to tune system performance at a previously unheard-of scale
- write solid code
- work across the stack
- Experience with RDMA programming, including but not limited to GPUDirect RDMA
- Experience with distributed workload managers like Slurm or K8s
Other
- 5+ years of experience with software (systems/application) development
- Excellent organizational, verbal, and written communication skills
- Bachelors in computer science and Engineering or related engineering fields
- Masters / PhD degree in Computer Science or related engineering fields
- adaptable, self-motivated engineers who learn quickly