Oracle Cloud Infrastructure (OCI) Cluster Networking team is building an ultra-high-performance network to support AI/ML/HPC workloads, and is looking to design systems that scale from tens to hundreds of thousands of GPUs without sacrificing performance.
Requirements
- Strong knowledge and practical experience with NCCL
- Experience with collective communications libraries like NCCL, RCCL, MPI and GPU frameworks like CUDA and ROCm
- Experience with ML training frameworks like PyTorch, TensorFlow
- Proficient at programming in any two out of C/C++, Python, Java, Scala, GO
- Proficient with data structures, algorithms, operating systems
- Experience with RDMA programming, including but not limited to GPUDirect RDMA
- Experience with distributed workload managers like Slurm or K8s
Responsibilities
- Design systems that scale from tens to hundreds of thousands of GPUs without sacrificing performance
- Develop and tune the software and hardware stack for distributed workloads using libraries such as NCCL on high-speed networks
- Apply collective communication libraries to tune system performance at a previously unheard-of scale
- Work across the stack to develop and tune the software and hardware stack for distributed workloads
- Collaborate with other engineers to design and implement ultra-high-performance networks
- Use collective communications libraries like NCCL, RCCL, MPI and GPU frameworks like CUDA and ROCm
- Work with ML training frameworks like PyTorch, TensorFlow
Other
- Bachelors in computer science and Engineering or related engineering fields
- Excellent organizational, verbal, and written communication skills
- 5+ years of experience with software (systems/application) development
- Certain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates
- Adaptable, self-motivated engineers who learn quickly and work across the stack