OCI (Oracle Cloud Infrastructure) AI Infrastructure is building a cutting-edge, ultra-high-performance GPU platform designed to support AI/ML/HPC workloads, enabling customers to scale from tens to thousands of GPUs without compromising performance.
Requirements
- 4+ years of backend software development experience
- Proficient in Java language or similar object-oriented languages. (GoLang)
- Experience with at least one scripting language (Python, Shell) for automating tasks, proof of concept work, or command line tools.
- Strong working experience on Git/Bitbucket.
- Hands-on experience building and operational tools and dashboards
- Hands-on experience developing services on a public cloud platform (e.g., AWS, Azure, Oracle)
- Experience and understanding of multi-AD/AZ and regional data centers
Responsibilities
- Designing and developing fundamental architectural changes for GPU delivery, health monitoring, triage automation, and diagnostic services.
- Running distributed AI/ML/HPC workloads across thousands of GPUs, leveraging technologies like RoCE and Infiniband.
- Build groundbreaking solutions for our customers from the ground up.
- Rock-solid developers and distributed systems engineers with a deep understanding of distributed systems and algorithms.
- Comfortable diving deep into any part of the stack, as well as software debugging and low-level systems troubleshooting.
- Hands-on experience building and operational tools and dashboards
- Building continuous integration/deployment pipelines with robust testing and deployment schedules
Other
- BS or MS degree in Computer Science or relevant technical field involving coding or equivalent practical experience
- Self-motivated individuals with a quick learning ability.
- Value simplicity and scalability in design and implementation.
- Comfortable working in a collaborative, agile environment and eager to learn.
- Ability to collaborate effectively with various dependencies, including Network and Data Center operations.