Oracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high-performance network to support AI/ML/HPC workloads
Requirements
- Strong knowledge and practical experience with NCCL
- Experience with collective communications libraries like NCCL, RCCL, MPI and GPU frameworks like CUDA and ROCm
- Experience with ML training frameworks like PyTorch, TensorFlow
- Proficient at programming in any two out of C/C++, Python, Java, Scala, GO
- Proficient with data structures, algorithms, operating systems
- Experience with RDMA programming, including but not limited to GPUDirect RDMA
- Experience with distributed workload managers like Slurm or K8s
Responsibilities
- Design systems that scale from tens to hundreds of thousands of GPUs without sacrificing performance
- Develop and tune the software and hardware stack for distributed workloads using libraries such as NCCL on high-speed networks
- Apply collective communication libraries to tune system performance at a previously unheard-of scale
- Work across the stack
- Write solid code
- Tune system performance for distributed workloads
- Collaborate in an agile environment
Other
- Bachelors in computer science and Engineering or related engineering fields
- Excellent organizational, verbal, and written communication skills
- 7+ years of experience with software (systems/application) development
- 2+ years of experience with collective communications libraries
- Masters / PhD degree in Computer Science or related engineering fields