The company is seeking to mature its Site Reliability Engineering (SRE) function for the Benefits product line, ensuring scalability, reliability, and operational excellence of its platforms.
Requirements
Deep expertise in SRE principles, including SLIs, SLOs, error budgets, and incident management.
Strong background in cloud platforms (Azure, AWS, GCP) and Kubernetes-based architectures.
Hands-on experience with observability tools (Datadog, Prometheus, Grafana, OpenTelemetry, etc.).
Strong understanding of CI/CD, Infrastructure as Code (Terraform, Pulumi), and security best practices.
Knowledge of modern software architectures, microservices, and distributed systems.
Deep understanding of failover recovery architectures - Active-Active, Active-Passive.
Experience working with compliance frameworks such as HIPAA, SOC 2, or HITRUST.
Responsibilities
Define and implement Service Level Objectives (SLOs), Error Budgets, and operational KPIs to continuously improve system reliability, performance, and recoverability.
Establish automated monitoring, observability, and incident response processes to proactively detect and resolve issues.
Drive incident management and problem management processes, ensuring effective root cause analysis and remediation.
Foster a culture of blameless postmortems and continuous learning, turning incidents into opportunities for improvement.
Partner with Product, Engineering, Architecture, Security, and Infrastructure teams to embed reliability into the software development lifecycle.
Influence engineering teams to adopt best practices for high-availability, scalability, and fault tolerance in application design.
Ensure SRE plays a critical role in migrating legacy systems to modern cloud-based architectures as part of the Benefits Modernization program.
Other
15+ years of experience in software engineering, reliability engineering, DevOps, and/or cloud infrastructure roles.
7+ years of leadership experience, managing large-scale engineering or SRE teams.
Proven track record leading large-scale transformations, especially in regulated industries such as healthcare, benefits, or financial services.
Exceptional communication, stakeholder management, and executive-level presentation skills.
Experience working in high-growth, Agile environments, driving cultural and process change.