Oracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high-performance network to support AI/ML/HPC workloads. The problem is to design systems that scale from tens to hundreds of thousands of GPUs without sacrificing performance.
Requirements
- 1+ years of experience with collective communications libraries like NCCL, RCCL, MPI and GPU frameworks like CUDA and ROCm.
- 1+ years of experience with ML training frameworks like PyTorch, TensorFlow
- Proficient at programming in any two out of C/C++, Python, Java, Scala, GO
- Proficient with data structures, algorithms, operating systems
- Experience with RDMA programming, including but not limited to GPUDirect RDMA
- Experience with distributed workload managers like Slurm or K8s
- Experience with Linux Performance tools
Responsibilities
- Develops and tunes the software and hardware stack for distributed workloads using libraries such as NCCL on high-speed networks.
- Apply collective communication libraries to tune system performance at a previously unheard-of scale.
- Write solid code.
- Work across the stack.
Other
- 5+ years of experience with software (systems/application) development
- Excellent organizational, verbal, and written communication skills
- Bachelors in computer science and Engineering or related engineering fields
- Masters / PhD degree in Computer Science or related engineering fields
- Experience in SDN, NFV, Cloud Networking