Oracle Cloud Infrastructure (OCI) needs to advance innovative technologies to effectively analyze, understand, respond to, and prevent cloud-scale incidents. The goal is to drive operational excellence and measurably reduce the duration, recurrence, and impact of incidents by developing resilient cloud services and automation solutions.
Requirements
- 6+ years of professional software development experience, with a focus on resilient cloud services, operational automation tools, and large-scale distributed systems
- Advanced proficiency in Java and experience with modern software development frameworks and tools
- Deep experience with cloud computing platforms (Oracle Cloud Infrastructure, AWS, Azure, or GCP)
- Extensive understanding of DevOps practices, including CI/CD, infrastructure as code, automated testing, and monitoring
- Demonstrated expertise in operational excellence, incident prevention, incident analysis, post-incident problem management, and reliability improvement programs
- Experience with observability tools (e.g., Grafana, Prometheus, ELK), scripting languages (e.g., Python, Bash), and AI/ML-based operational automation is advantageous
Responsibilities
- Architect, design, and develop scalable solutions and automation that address incident prevention, detection, analysis, resolution, and problem management at cloud scale
- Drive technical excellence by setting best practices in code quality, proactive monitoring, testing, and documentation
- Provide technical leadership for key initiatives, breaking down complex problems and guiding the team toward long-term, strategic solutions
- Collaborate cross-functionally with engineers, product owners, and operations leaders to identify requirements, align priorities, and deliver impactful tools
- Lead and participate in post-incident reviews, using data-driven approaches to address root causes and implement prevention measures
- Continuously evaluate and introduce new technologies, frameworks, and methodologies to evolve the team’s capabilities
- Support production systems, participate in on-call rotation as needed, and champion proactive approaches to reliability, prevention, and automation
Other
- Mentor and coach junior engineers, fostering a culture of knowledge sharing, learning, and continuous improvement
- Effectively communicate complex technical concepts to both technical and non-technical stakeholders through documentation, presentations, and executive summaries
- Track record of technical leadership within engineering teams, including mentoring, coaching, and driving complex projects to completion
- Experience working in an Agile/Scrum environment
- Strong analytical and problem-solving skills, and a passion for tackling technically challenging issues