Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Staff Site Reliability Engineer - Site Reliability Engineering

$137,500 - $236,500

Sep 30, 2025

San Jose, CA, USA

Ensuring high availability, performance and scalability of critical systems powering PayPal Shopping/Honey’s business.

8+ years in Cloud Infrastructure, Site Reliability Engineering (SRE), DevOps Engineering, or related fields
At least 4+ years of hands-on experience deploying, managing, and optimizing containerized applications using GKE, and Harness in both public and private cloud environments (AWS, GCP, Azure, etc.), preferably Google Cloud Platform (GCP).
4+ years of hands-on experience with Infrastructure-as-code (Terraform, CloudFormation), CI/CD pipelines (CircleCI, Harness, Jenkins, ArgoCD), and experience in Node, Python, or Go.
Strong understanding of using Google Cloud Logging, DataDog, or other monitoring and observability tools.
Ability to effectively diagnose and resolve performance bottlenecks within GCP at the infrastructure and application layers.
Own and enhance the reliability of services deployed across various cloud regions. You will proactively monitor, automate, and scale services to ensure seamless uptime and performance with an eye on cost.
Lead the containerization, deployment, and scaling of microservices and data pipelines on Google Kubernetes Engine (GKE), with a strong emphasis on reliability and fault tolerance.

Manage and deliver large-scale reliability improvement projects, ensuring systems are performant, available, and resilient.
Drive the identification of performance bottlenecks and lead initiatives to optimize and scale critical systems and services.
Architect and implement scalable infrastructure solutions to support growing user demands while maintaining system reliability.
Lead the design and enhancement of monitoring frameworks, ensuring systems are highly observable, and support the response to production incidents.
Take ownership of improving system resilience by designing fault-tolerant architectures and implementing disaster recovery strategies.
Lead capacity planning initiatives to ensure system resources are proactively managed, preventing downtime or performance degradation under high load.
Work closely with development, operations, and other technical teams to ensure seamless system integration and align on best practices for reliability.

Minimum of 8 years of relevant work experience and a Bachelor's degree or equivalent experience.
Strong leadership abilities; must havecustomer focus and commitment to quality.
Must have great interpersonal skills; solid communication skills, written and verbal.
Ability to remain composed, methodical, and think fast in a high-pressure environment.
Experience in managing, collaborating, and influencing global teams.