Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Senior System Software Engineer, Cloud Services

NVIDIA

$184,000 - $287,500

Sep 3, 2025

Santa Clara, CA, US

NVIDIA is seeking a software engineer to build, operate, and maintain cloud-hosted services for user and service authentication/authorization, focusing on ensuring continuity of operations, reliability, performance, and scalability.

Requirements

Hands-on experience with modern monitoring systems (Prometheus, Grafana, Loki, Tempo, Datadog, New Relic, OpenTelemetry, etc.) within a production environment.
Advanced coding skills in Python, Go, or similar languages for building automation and integrating observability solutions.
Comfort with JavaScript frameworks such as React and Next.js.
Proficiency in cloud platforms (AWS, GCP, Azure) and containerized environments (Kubernetes, Docker); experience with configuration-as-code tools (Terraform, Helm, Ansible).
Experience with incident management, postmortem processes.
Familiarity with the Java Spring Boot framework, hands-on experience with Apache Cassandra and HashiCorp Vault would be very advantageous.
Having relevant coding experience and being open to supporting development would be a huge plus.

Responsibilities

Architect, implement, and maintain observability systems at scale to enable monitoring, alerting, logging, and tracing for our cloud-based services.
Define and refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets in partnership with service owners and product teams.
Invent, construct, and uphold actionable dashboards that display important measurements, SLI/SLOs, and system health for distributed services.
Collaborate with software, platform, and networking teams to integrate observability at all stages of the application lifecycle, from development to incident response.
Drive automation efforts to reduce manual toil in monitoring, telemetry, and incident response workflows; build and maintain self-service observability tooling.
Address performance and reliability issues by bringing to bear root cause analysis, distributed tracing, and log correlation.
Participate in Pager Duty rotations, contribute to post-incident reviews, detailing findings and driving solutions that improve long-term system resilience and visibility.

Other

8+ years in large-scale systems engineering roles with exposure to dealing with live service development, working end-to-end from service development, deployment, and observability, as well as being on-call.
Strong communication and collaboration skills, with experience working in global, cross-disciplinary teams.
Detailed, analytical problem-solving approach and high standards for operational excellence and customer happiness.
Develop expertise in the functions and capabilities of our offerings, and assist in managing our support channels for other NVIDIA teams.