Oracle is building a robust ecosystem to support the end-to-end lifecycle of AI and machine learning workloads on its OCI cloud platform, aiming to empower teams to build and deploy AI at scale.
Requirements
- 5+ years of experience shipping scalable, cloud native distributed systems
- Experience building control plane/data plane solutions for cloud native companies
- Proficient in Go, Java
- Experience with Kubernetes controllers, operators and CRDs
- Experienced at building highly available services, possessing knowledge of common service-oriented design patterns and service-to-service communication protocols
- Experience with production operations and best practices for putting quality code in production and troubleshoot issues when they arise
- Deep understanding of Unix-like operating systems
Responsibilities
- Design, implement, and operate scalable services for GPU-based model training, tuning, and inference.
- Build tools and APIs that enable internal and external users to easily launch, monitor, and manage ML workloads.
- Collaborate with product, infrastructure, and ML engineering teams to define and deliver key platform features.
- Optimize performance, reliability, and efficiency of AI infrastructure using best-in-class engineering practices.
- Contribute to platform automation, observability, CI/CD pipelines, and operational excellence.
- Troubleshoot complex issues in distributed systems and participate in on-call rotations as needed.
- Build cloud service on top of the modern Infrastructure as a Service (IaaS) building blocks at OCI
Other
- Collaborate with top engineers and researchers in a fast-paced, innovation-driven environment.
- Grow your career in a supportive, mission-driven team building the future of enterprise AI.
- Mentor junior engineers and participate in design and code reviews.
- Participate in on-call for the service with the team
- Able to effectively communicate technical ideas verbally and in writing (technical proposals, design specs, architecture diagrams and presentations)