Oracle Cloud Infrastructure (OCI) AI Infrastructure is building a cutting-edge, ultra-high-performance GPU platform to support AI/ML/HPC workloads, enabling customers to scale from tens to thousands of GPUs without compromising performance.
Requirements
- Proficient in one programming language (Java/Python/C/C++/GoLang/Shell scripting)
- Deep understanding of operating systems, computer networks, and high-performance applications
- 6+ years’ experience delivering and operating large-scale production systems (1000+ server instances)
- Strong background in Linux systems
- Familiarity with system-level architecture, data synchronization, fault tolerance, and state management
- Experience with Infiniband or RoCE networking
- Good understanding of databases and SQL (MySQL) and caching technologies (Redis, Memcache etc)
Responsibilities
- Designing and developing fundamental architectural changes for GPU delivery
- Health monitoring, triage automation, and diagnostic services for distributed AI/ML/HPC workloads
- Operating large-scale production systems with 1000+ server instances
- Delivering products and experience with the full software development lifecycle
- Working with technologies like RoCE and Infiniband
- Designing, developing, and operating public cloud service data planes
- Hands-on experience with GPU hardware architecture and system management
Other
- BS or MS degree in Computer Science or relevant technical field involving coding or equivalent practical experience
- Systematic problem-solving approach, strong communication skills, a sense of ownership, and drive
- Certain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates
- Role will generally accept applications for at least three calendar days from the posting date or as long as the job remains posted
- Oracle is an Equal Employment Opportunity Employer