OCI (Oracle Cloud Infrastructure) AI Infrastructure is looking to solve the problem of building a cutting-edge, ultra-high-performance GPU platform to support AI/ML/HPC workloads, allowing customers to scale from tens to thousands of GPUs without compromising performance.
Requirements
- Deep understanding of operating systems, computer networks, and high-performance applications
- Proficient in one programming language (java/python/c/c++/goLang/shell scripting)
- Strong background in Linux systems
- Familiarity with system-level architecture, data synchronization, fault tolerance, and state management
- General enterprise storage, networking, or computing experience
- Experience with RoCE and Infiniband technologies
- Understanding of distributed systems and algorithms
Responsibilities
- Designing and developing fundamental architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services
- Designing, implementing, and delivering software, firmware for managing GPU based AI servers
- Working closely with product teams to debug, resolve customer's issues
- Building groundbreaking solutions for customers from the ground up
- Delivering and operating large-scale production systems (1000+ server instances)
- Diving deep into any part of the stack, as well as software debugging and low-level systems troubleshooting
- Collaborating effectively with various dependencies, including Network and Data Center operations
Other
- BS or MS degree in Computer Science or relevant technical field involving coding or equivalent practical experience
- Adaptable Engineers: Self-motivated individuals with a quick learning ability
- Collaborative Spirit: Comfortable working in a collaborative, agile environment and eager to learn
- Ability to collaborate effectively with various dependencies
- 4+ years’ experience delivering and operating large-scale production systems (1000+ server instances)