Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Babylist Logo

Senior Software Engineer, Site Reliability

Babylist

$186,818 - $232,000
Dec 7, 2025
Canada, KY, US
Apply Now

Babylist is looking to ensure the stability, scalability, and reliability of its systems and services by hiring a Senior Software Engineer, Site Reliability to support shared infrastructure and developer tools.

Requirements

  • Proficiency with Terraform is a must, as you will be a member of the team responsible for managing and building our AWS infrastructure using Infrastructure as Code (IaC) practices
  • You possess strong experience working with AWS cloud-based infrastructure and services, ensuring their reliability, performance, and security
  • Proficiency with Docker and Kubernetes is essential, as you will contribute to the design, deployment, and management of containerized applications in our environment
  • You have a solid understanding of cloud-native systems design, including CDNs, load balancers, cloud networking, DNS, caching, and distributed systems
  • Troubleshooting and debugging are second nature to you, allowing you to quickly identify and resolve issues across various environments
  • Experience designing and supporting CI systems such as CircleCI, Jenkins, or GitHub Actions
  • You are familiar with monitoring and alerting best practices, utilizing tools like Datadog, Cronitor, Sentry, and PagerDuty to ensure proactive identification and resolution of issues

Responsibilities

  • Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform.
  • Improve the speed and reliability of our Continuous Integration (CI) systems to support the entire Engineering Team, enabling faster and more efficient development and deployment processes.
  • Provide support to developers in troubleshooting issues across local development, staging, and production environments.
  • Establish, communicate, and support best practices for monitoring and alerting.
  • This will involve setting up effective monitoring systems and defining actionable alerts for proactive incident management.
  • Ensuring that our EKS clusters and databases are running up-to-date versions, optimizing performance and reliability.
  • Contribute to the design, deployment, and management of containerized applications in our environment.

Other

  • 8+ years of experience as a Site Reliability Engineer or similar role, demonstrating a strong background in maintaining highly available and scalable systems
  • Experience supporting high-traffic consumer-facing websites, understanding the unique challenges and considerations in maintaining such systems
  • Proven experience in on-call management best practices, including effective incident response, escalation procedures, and post-incident reviews to drive continuous improvement and ensure system reliability
  • You have excellent verbal and written communication skills, and the ability to collaborate effectively with cross-functional teams
  • You're comfortable and enthusiastic about working in an AI-forward environment where AI tools are part of daily operations.