OCI (Oracle Cloud Infrastructure) AI Infrastructure is building a cutting-edge, ultra-high-performance GPU platform designed to support AI/ML/HPC workloads, enabling customers to scale from tens to thousands of GPUs without compromising performance.
Requirements
- Deep understanding of operating systems, computer networks, and high-performance applications
- 6+ years’ experience delivering and operating large-scale production systems (1000+ server instances)
- Proficient in one programming language(java/python/c/c++/goLang/shell scripting)
- Strong background in Linux systems
- Familiarity with system-level architecture, data synchronization, fault tolerance, and state management.
- General enterprise storage, networking, or computing experience
- Experience with Server/GPU hardware architecture and system management.
Responsibilities
- Designing and developing fundamental architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services.
- Running distributed AI/ML/HPC workloads across thousands of GPUs, leveraging technologies like RoCE and Infiniband.
- Building groundbreaking solutions for our customers from the ground up.
- Deep diving into any part of the stack, as well as software debugging and low-level systems troubleshooting.
- Designing, developing, and operating public cloud service data planes.
- Systematic problem-solving approach, strong communication skills, a sense of ownership, and drive.
- Delivering products and experience with the full software development lifecycle
Other
- Self-motivated individuals with a quick learning ability.
- Value simplicity and scalability in design and implementation.
- Comfortable working in a collaborative, agile environment and eager to learn.
- Ability to collaborate effectively with various dependencies, including Network and Data Center operations.
- BS or MS degree in Computer Science or relevant technical field involving coding or equivalent practical experience