WGU is looking to improve the reliability, performance, and operational efficiency of its critical systems and services to ensure students and faculty are delighted with the fully online educational experience.
Requirements
- Strong understanding of distributed systems, cloud-native architectures, and infrastructure design.
- Deep familiarity with cloud service providers (AWS, GCP, Azure) and their reliability and security best practices.
- Knowledge of software development lifecycles, DevOps principles, and SRE practices such as SLOs, SLIs, and error budgets.
- Technical proficiency in infrastructure as code, automation frameworks, and modern programming/scripting languages (Python, Go, Bash, etc.).
- Expertise in monitoring, logging, and observability platforms (Prometheus, Grafana, Datadog, Splunk, etc.).
- Skilled in incident management, root cause analysis, and postmortem processes.
- Hands-on experience with Kubernetes, container orchestration, and microservices architectures.
Responsibilities
- Defines reliability roadmaps and communicate priorities to engineering and executive stakeholders.
- Develops, drives, and supports Service Level Objectives (SLOs), Indicators (SLIs), and Agreements (SLAs) across systems.
- Directs incident management processes, including response coordination, root cause analysis, and follow-up actions.
- Implements practices that reduce downtime and ensure systems meet availability, scalability, and performance expectations.
- Drives adoption of infrastructure as code, CI/CD pipelines, and automated testing to improve operational efficiency.
- Oversees monitoring, alerting, and observability systems that provide insight into service health.
- Partners with software engineering, security, and product teams to integrate reliability into all development lifecycle phases.
Other
- Leads and mentors SRE teams, creating an environment that encourages ownership, collaboration, and continuous improvement.
- Establishes the SRE vision, goals, and operational strategies in alignment with organizational objectives.
- Strong leadership and people management skills, with experience developing and scaling technical teams.
- Effective communication skills, including the ability to explain technical concepts to both engineers and executives.
- Ability to balance short-term operational needs with long-term reliability and scalability goals.