Improve the reliability, scalability, and performance of Benchmark Education Company's cloud-based systems.
Requirements
- 5+ years of experience in Site Reliability Engineering, DevOps, or Software Engineering with a focus on production operations.
- Strong knowledge of AWS cloud services and cloud-native architectures.
- Proficiency in scripting or programming languages (e.g., Python, Bash).
- Experience with observability tools (e.g., CloudWatch, Datadog, Prometheus, Grafana).
- Familiarity with infrastructure-as-code tools (e.g., Terraform, CloudFormation) and CI/CD pipelines.
- AWS certifications (e.g., AWS Certified Solutions Architect – Associate or AWS Certified DevOps Engineer – Associate).
- Experience with containerization (Docker, ECS, Kubernetes/EKS).
Responsibilities
- Contribute to the design, development, and delivery of features that enhance system reliability and scalability.
- Define, measure, and improve SLIs, SLOs, and error budgets in collaboration with engineering teams.
- Implement and improve observability tooling and practices to monitor the health and performance of production systems.
- Participate in incident management, including on-call rotations, root cause analysis, and postmortem reviews.
- Lead smaller initiatives or components of larger projects, ensuring technical quality and operational readiness.
- Collaborate with software engineering, security, and product teams to ensure resilient and secure system design.
- Contribute to automation efforts to reduce toil and improve efficiency of operational processes.
Other
- Participate in building a culture of reliability through knowledge sharing, documentation, and process improvements.
- Mentor junior engineers, sharing expertise in SRE principles and AWS best practices.
- Strong problem-solving skills and ability to work cross-functionally.
- Some experience mentoring or coaching junior engineers.
- Commitment to reliability, operational excellence, and continuous improvement.