Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

NVIDIA Logo

Senior System Software Engineer, Cloud Services

NVIDIA

$184,000 - $287,500
Sep 3, 2025
Santa Clara, CA, US
Apply Now

NVIDIA is seeking a software engineer to build, operate, and maintain cloud-hosted services for user and service authentication/authorization, focusing on ensuring continuity of operations, reliability, performance, and scalability.

Requirements

  • Hands-on experience with modern monitoring systems (Prometheus, Grafana, Loki, Tempo, Datadog, New Relic, OpenTelemetry, etc.) within a production environment.
  • Advanced coding skills in Python, Go, or similar languages for building automation and integrating observability solutions.
  • Comfort with JavaScript frameworks such as React and Next.js.
  • Proficiency in cloud platforms (AWS, GCP, Azure) and containerized environments (Kubernetes, Docker); experience with configuration-as-code tools (Terraform, Helm, Ansible).
  • Experience with incident management, postmortem processes.
  • Familiarity with the Java Spring Boot framework, hands-on experience with Apache Cassandra and HashiCorp Vault would be very advantageous.
  • Having relevant coding experience and being open to supporting development would be a huge plus.

Responsibilities

  • Architect, implement, and maintain observability systems at scale to enable monitoring, alerting, logging, and tracing for our cloud-based services.
  • Define and refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets in partnership with service owners and product teams.
  • Invent, construct, and uphold actionable dashboards that display important measurements, SLI/SLOs, and system health for distributed services.
  • Collaborate with software, platform, and networking teams to integrate observability at all stages of the application lifecycle, from development to incident response.
  • Drive automation efforts to reduce manual toil in monitoring, telemetry, and incident response workflows; build and maintain self-service observability tooling.
  • Address performance and reliability issues by bringing to bear root cause analysis, distributed tracing, and log correlation.
  • Participate in Pager Duty rotations, contribute to post-incident reviews, detailing findings and driving solutions that improve long-term system resilience and visibility.

Other

  • 8+ years in large-scale systems engineering roles with exposure to dealing with live service development, working end-to-end from service development, deployment, and observability, as well as being on-call.
  • Strong communication and collaboration skills, with experience working in global, cross-disciplinary teams.
  • Detailed, analytical problem-solving approach and high standards for operational excellence and customer happiness.
  • Develop expertise in the functions and capabilities of our offerings, and assist in managing our support channels for other NVIDIA teams.