OCI (Oracle Cloud Infrastructure) AI Infrastructure is building a cutting-edge, ultra-high-performance GPU platform designed to support AI/ML/HPC workloads, enabling customers to scale from tens to thousands of GPUs without compromising performance.
Requirements
- Deep understanding of operating systems, computer networks, and high-performance applications
- Proficient in one programming language(java/python/c/c++/goLang/shell scripting)
- Strong background in Linux systems
- Familiarity with system-level architecture, data synchronization, fault tolerance, and state management.
- General enterprise storage, networking, or computing experience
- Experience with Infiniband or RoCE networking
- Good understanding of databases and SQL (MySQL) and caching technologies (Redis, Memcache etc)
Responsibilities
- Designing and developing fundamental architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services.
- Running distributed AI/ML/HPC workloads across thousands of GPUs, leveraging technologies like RoCE and Infiniband.
- Delivering and operating large-scale production systems (1000+ server instances).
- Systematic problem-solving approach.
- Proven ability to deliver products and experience with the full software development lifecycle.
- Hands-on experience designing, developing, and operating public cloud service data planes.
- Experience with Server/GPU hardware architecture and system management.
Other
- BS or MS degree in Computer Science or relevant technical field involving coding or equivalent practical experience
- 6+ years’ experience delivering and operating large-scale production systems (1000+ server instances)
- strong communication skills, a sense of ownership, and drive.
- Certain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates.
- Oracle US offers a comprehensive benefits package