Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

BABYLIST Logo

Staff Software Engineer, Site Reliability

BABYLIST

$199,200 - $239,040
Sep 25, 2025
Remote, US
Apply Now

Babylist is looking to ensure the stability, scalability, and reliability of its systems and services by hiring a Staff Software Engineer, Site Reliability.

Requirements

  • Proficiency with Terraform is a must, as you will be a member of the team responsible for managing and building our AWS infrastructure using Infrastructure as Code (IaC) practices
  • You possess strong experience working with AWS cloud-based infrastructure and services, ensuring their reliability, performance, and security
  • Proficiency with Docker and Kubernetes is essential, as you will contribute to the design, deployment, and management of containerized applications in our environment
  • You have a solid understanding of cloud-native systems design, including CDNs, load balancers, cloud networking, DNS, caching, and distributed systems
  • Troubleshooting and debugging are second nature to you, allowing you to quickly identify and resolve issues across various environments
  • Experience designing and supporting CI systems such as CircleCI, Jenkins, or GitHub Actions
  • You are familiar with monitoring and alerting best practices, utilizing tools like Datadog, Cronitor, Sentry, and PagerDuty to ensure proactive identification and resolution of issues

Responsibilities

  • Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform.
  • Ensure that our EKS clusters and databases are running up-to-date versions, optimizing performance and reliability
  • Improve the speed and reliability of our Continuous Integration (CI) systems to support the entire Engineering Team, enabling faster and more efficient development and deployment processes
  • Provide support to developers in troubleshooting issues across local development, staging, and production environments
  • Establish, communicate, and support best practices for monitoring and alerting.
  • This will involve setting up effective monitoring systems and defining actionable alerts for proactive incident management

Other

  • 8+ years of experience as a Site Reliability Engineer or similar role, demonstrating a strong background in maintaining highly available and scalable systems
  • Experience supporting high-traffic consumer-facing websites, understanding the unique challenges and considerations in maintaining such systems
  • Proven experience in on-call management best practices, including effective incident response, escalation procedures, and post-incident reviews to drive continuous improvement and ensure system reliability
  • You have excellent verbal and written communication skills, and the ability to collaborate effectively with cross-functional teams.
  • You pay close attention to detail