o9 is looking to solve the problem of transforming decision-making through an AI-first approach, integrating siloed planning capabilities, and capturing value leakage to help businesses plan smarter and faster, thereby enhancing operational efficiency and reducing waste.
Requirements
- Strong knowledge of cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes).
- Expertise in observability tools (Prometheus, Grafana, Datadog, etc.) and incident management platforms.
- Experience with configuration management tools (Terraform, Ansible, Helm, etc.).
- Solid understanding of networking, security, Linux internals, and distributed systems.
- Relevant cloud certifications (AWS, Azure, or GCP) strongly preferred.
- Kubernetes Administration (CKA) certification is a plus.
- Experience operating complex, cloud-native production systems at scale.
Responsibilities
- Hire, mentor, and manage a globally distributed team of Site Reliability Engineers.
- Own system uptime and SLA compliance across o9’s cloud-native production environment.
- Drive root cause analysis and implement post-incident learning processes to improve system resilience.
- Oversee the design and implementation of robust monitoring, alerting, and logging solutions.
- Lead initiatives to improve infrastructure automation, deployment pipelines, and CI/CD practices.
- Champion Infrastructure as Code (IaC) and GitOps best practices.
- Manage capacity planning, scalability efforts, and performance tuning across services.
Other
- Bachelor’s degree in Computer Science, Engineering, or a related field required; Master’s degree preferred.
- 8+ years of experience in DevOps, SRE, or infrastructure roles, with 2+ years leading or managing technical teams.
- Proven ability to lead technical teams through high-stakes, high-impact situations.
- Strong communication skills with the ability to translate complex topics into clear stakeholder updates.
- Strategic mindset with a bias for action and problem-solving.