Rivian and Volkswagen Group Technologies is looking to solve challenges in automotive's next chapter by developing technology for software-defined vehicles, specifically focusing on operating systems, zonal controllers, and cloud/connectivity solutions. The Senior SRE role is crucial for ensuring the health, performance, and reliability of their production environment through robust observability systems.
Requirements
- Proficiency in designing and operating observability platforms with tools like Prometheus, Grafana, Loki, Jaeger, or Datadog.
- Experience with OpenTelemetry and distributed tracing in microservices architectures.
- Deep knowledge of Kubernetes (e.g., EKS), ArgoCD, and Crossplane.
- Strong proficiency in Python, Go, or similar languages for building automation and custom telemetry solutions.
- Familiarity with multi-cloud setups, containerization (Docker), and Linux system fundamentals.
Responsibilities
- Observability Platform Design: Architect, implement, and maintain observability systems, leveraging tools like Datadog, LGTM stack, OpenTelemetry, and Vector to enable real-time performance monitoring, logging, and alerting.
- Telemetry Optimization: Evolve and scale telemetry pipelines to ensure low latency and high availability for metrics, logs, and traces across multi-cloud environments.
- Performance Engineering: Proactively identify performance bottlenecks, optimize systems, and provide recommendations for reliability improvements.
- Scalable Automation: Implement automation solutions to scale systems sustainably while driving improvements in reliability and deployment velocity.
- Incident Management: Collaborate with the incident response team to establish data-driven debugging and troubleshooting processes using observability data.
- Tooling Development: Create and maintain self-service observability tools and dashboards to empower teams across the organization.
- Cross-functional Collaboration: Partner with development, DevOps, and infrastructure teams to define SLOs/SLIs and ensure observability is embedded throughout the software lifecycle.
Other
- 5+ years in Site Reliability Engineering or a related role with a strong emphasis on observability.
- Exceptional problem-solving, communication, and a data-driven approach to decision-making.
- Equal Opportunity Employer statement
- Commitment to ensuring hiring process accessibility for persons with disabilities.
- Candidate Data Privacy statement regarding collection, use, and disclosure of personal information.