Design and implement scalable and automated operational processes for incident management, change execution, security operations, capacity planning, monitoring, and disaster recovery to drive reliability engineering and operational excellence
Requirements
- Expertise in Exadata and Oracle databases on Exadata platform
- Proven experience in designing and managing large-scale cloud infrastructure operations in environments like OCI, AWS, Azure, GCP, or similar platforms
- Strong knowledge of automation and orchestration tools (e.g., Terraform, Ansible, Kubernetes, etc.)
- Expertise in monitoring and observability tools (e.g., Prometheus, Grafana, Datadog, New Relic)
- Deep understanding of operational frameworks such as ITIL, SRE principles, and DevOps methodologies
- Experience in multi-cloud or hybrid cloud environments
- Certifications such as AWS Certified Solutions Architect, Google Cloud Professional Architect, or similar is desirable
Responsibilities
- Design and implement scalable and automated operational processes for incident management, change execution, security operations, capacity planning, monitoring, and disaster recovery
- Collaborate with Operations and Development teams to ensure that operational workflows align with reliability and scalability goals
- Define and implement KPIs and SLAs for operational performance, and develop continuous improvement programs to meet and exceed them
- Lead efforts to automate repetitive and manual operational tasks using tools, scripts, and platforms to improve efficiency and reduce risk
- Develop and refine incident management and response strategies, ensuring rapid resolution and root cause analysis for critical issues
- Architect and implement systems to monitor, predict, and optimize infrastructure utilization across a global scale
- Partner with engineering and product teams to ensure operational readiness for new services and features
Other
- US Citizenship AND active TS/SCI w/Poly US Government Security Clearance required
- 10+ years of experience in cloud infrastructure operations, SRE, or similar roles
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
- Strong collaboration and communication skills to work effectively with cross-functional teams and stakeholders
- Ability to lead through influence and drive consensus on technical and process improvements