Babylist is looking to ensure the stability, scalability, and reliability of its systems and services by hiring a Staff Software Engineer, Site Reliability.
Requirements
- Proficiency with Terraform is a must, as you will be a member of the team responsible for managing and building our AWS infrastructure using Infrastructure as Code (IaC) practices
- You possess strong experience working with AWS cloud-based infrastructure and services, ensuring their reliability, performance, and security
- Proficiency with Docker and Kubernetes is essential, as you will contribute to the design, deployment, and management of containerized applications in our environment
- You have a solid understanding of cloud-native systems design, including CDNs, load balancers, cloud networking, DNS, caching, and distributed systems
- Troubleshooting and debugging are second nature to you, allowing you to quickly identify and resolve issues across various environments
- Experience designing and supporting CI systems such as CircleCI, Jenkins, or GitHub Actions
- You are familiar with monitoring and alerting best practices, utilizing tools like Datadog, Cronitor, Sentry, and PagerDuty to ensure proactive identification and resolution of issues
Responsibilities
- Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform.
- Ensure that our EKS clusters and databases are running up-to-date versions, optimizing performance and reliability
- Improve the speed and reliability of our Continuous Integration (CI) systems to support the entire Engineering Team, enabling faster and more efficient development and deployment processes
- Provide support to developers in troubleshooting issues across local development, staging, and production environments
- Establish, communicate, and support best practices for monitoring and alerting.
- This will involve setting up effective monitoring systems and defining actionable alerts for proactive incident management
Other
- 8+ years of experience as a Site Reliability Engineer or similar role, demonstrating a strong background in maintaining highly available and scalable systems
- Experience supporting high-traffic consumer-facing websites, understanding the unique challenges and considerations in maintaining such systems
- Proven experience in on-call management best practices, including effective incident response, escalation procedures, and post-incident reviews to drive continuous improvement and ensure system reliability
- You have excellent verbal and written communication skills, and the ability to collaborate effectively with cross-functional teams.
- You pay close attention to detail