Oracle is looking for a Senior Software Engineer to help shape the future of AI infrastructure and services at Oracle by working on critical components of OCI’s AI platform, including high-scale GPU cluster management, self-service ML infrastructure, and model training and serving systems.
Requirements
- 4+ years distributed service engineering experience in a software development environment
- Development experience in a modern programming language, such as Java, Python
- Hands-on experience designing, developing, and operating public cloud service control or data planes
- Experience with distributed systems, container orchestration (e.g., Kubernetes), and microservices architecture.
- Understanding of machine learning pipelines, model training/tuning, and GPU workloads.
- Familiarity with AI frameworks (e.g., PyTorch, TensorFlow) and MLOps tools (e.g., MLflow, Ray, Kubeflow) is a plus.
- Familiarity with NVIDIA tools like CUDA, NCCL, Run:ai is a plus
Responsibilities
- Design, implement, and operate scalable services for GPU-based model training, tuning, and inference.
- Build tools and APIs that enable internal and external users to easily launch, monitor, and manage ML workloads.
- Collaborate with product, infrastructure, and ML engineering teams to define and deliver key platform features.
- Optimize performance, reliability, and efficiency of AI infrastructure using best-in-class engineering practices.
- Contribute to platform automation, observability, CI/CD pipelines, and operational excellence.
- Troubleshoot complex issues in distributed systems and participate in on-call rotations as needed.
- Mentor junior engineers and participate in design and code reviews.
Other
- Demonstrable technical leadership and mentorship skills
- BS degree in Computer Science or related field
- Bachelor's or Master’s degree in Computer Science, Engineering, or a related field.