Lead observability, resiliency, executive reporting, and Level‑3+ support escalations for the Sales & Acquisition platform to ensure operational excellence, reliability, performance, and compliance for a cloud‑native, omnichannel ecosystem.
Requirements
- 8+ years in application development or platform engineering, with 3+ years in SRE/observability or production operations leadership.
- Proven experience implementing observability frameworks and incident/problem/change management practices.
- Strong knowledge of resiliency engineering
- Ability to translate technical telemetry into clear, business‑aligned insights.
- Relevant certifications (AWS, Azure, SAFe) are a plus.
Responsibilities
- Own observability standards: implement metrics/logs/traces, alerting, dashboards, and SLO/SLIs across Sales & Acquisition services.
- Deliver reliability reporting: produce actionable scorecards on uptime, latency, error budgets, and MTTR for leadership.
- Lead resiliency efforts: define fault domains, run failover/chaos tests, and close gaps through prioritized backlogs.
- Manage L3+ escalations: act as escalation point for Sev‑1/Sev‑2 incidents, coordinate response, and drive root cause analysis.
- Improve change readiness: enforce guardrails and automated health checks in CI/CD pipelines.
- Capacity & cost governance: monitor usage trends and optimize for performance and spend.
- Team leadership: mentor a small team of SRE/observability engineers and embed best practices across squads.
Other
- Excellent communication and stakeholder management skills.
- Bachelor’s degree in Computer Science, Engineering, Information Systems, or related field required.
- Master’s degree preferred.