Oracle is looking to build a robust ecosystem to support the end-to-end lifecycle of AI and machine learning workloads on its OCI cloud platform, including high-scale GPU cluster management, self-service ML infrastructure, and model serving systems.
Requirements
- 8+ years of experience shipping scalable, cloud native distributed systems
- Experience with building multi-tenant Kubernetes and security isolation.
- Built Kubernetes controllers, operators and CRDs to automate lifecycle management of AI/ML workloads .
- Implement advanced optimizations: distributed and disaggregated inference serving, multi-node inference, KV-cache reuse.
- Build intelligent request routing and adaptive scheduling to maximize GPU utilization.
- Experience inference solutions like: Nvidia Dynamo, vLLM, Ray Serve.
- Experience with production operations and best practices for putting quality code in production and troubleshoot issues when they arise
Responsibilities
- Build cloud service on top of the modern Infrastructure as a Service (IaaS) building blocks at OCI
- Design and build distributed, scalable, fault tolerant software systems
- Participate in the entire software lifecycle – development, testing, CI and production operations
- Design and lead software projects without needing significant guidance and guide/mentor/coach junior engineers
- Balance between product feature development and production operational concerns like writing runbooks, ops automation, structured logging, instrumentation for metrics and events
- Leverage internal tooling at OCI to develop, build, deploy and troubleshoot software
- Participate in on-call for the service with the team
Other
- Able to effectively communicate technical ideas verbally and in writing (technical proposals, design specs, architecture diagrams and presentations)
- BS in Computer Science, or equivalent experience
- MS in Computer Science
- Deep understanding of Unix-like operating systems
- Production experience with Cloud and ML technologies