Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Senior Site Reliability Engineer - Observability

Rivian

$146,900 - $194,610

Sep 22, 2025

Palo Alto, CA, USA

Rivian and Volkswagen Group Technologies is looking to solve challenges in automotive's next chapter by developing technology for software-defined vehicles, specifically focusing on operating systems, zonal controllers, and cloud/connectivity solutions. The Senior SRE role is crucial for ensuring the health, performance, and reliability of their production environment through robust observability systems.

Requirements

Proficiency in designing and operating observability platforms with tools like Prometheus, Grafana, Loki, Jaeger, or Datadog.
Experience with OpenTelemetry and distributed tracing in microservices architectures.
Deep knowledge of Kubernetes (e.g., EKS), ArgoCD, and Crossplane.
Strong proficiency in Python, Go, or similar languages for building automation and custom telemetry solutions.
Familiarity with multi-cloud setups, containerization (Docker), and Linux system fundamentals.

Responsibilities

Observability Platform Design: Architect, implement, and maintain observability systems, leveraging tools like Datadog, LGTM stack, OpenTelemetry, and Vector to enable real-time performance monitoring, logging, and alerting.
Telemetry Optimization: Evolve and scale telemetry pipelines to ensure low latency and high availability for metrics, logs, and traces across multi-cloud environments.
Performance Engineering: Proactively identify performance bottlenecks, optimize systems, and provide recommendations for reliability improvements.
Scalable Automation: Implement automation solutions to scale systems sustainably while driving improvements in reliability and deployment velocity.
Incident Management: Collaborate with the incident response team to establish data-driven debugging and troubleshooting processes using observability data.
Tooling Development: Create and maintain self-service observability tools and dashboards to empower teams across the organization.
Cross-functional Collaboration: Partner with development, DevOps, and infrastructure teams to define SLOs/SLIs and ensure observability is embedded throughout the software lifecycle.

Other

5+ years in Site Reliability Engineering or a related role with a strong emphasis on observability.
Exceptional problem-solving, communication, and a data-driven approach to decision-making.
Equal Opportunity Employer statement
Commitment to ensuring hiring process accessibility for persons with disabilities.
Candidate Data Privacy statement regarding collection, use, and disclosure of personal information.