Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

NVIDIA Logo

Senior Software Engineer, Observability

NVIDIA

$184,000 - $356,500
Dec 8, 2025
Santa Clara, CA, US
Apply Now

NVIDIA's Observability team is seeking a Senior/Staff Engineer to compose and build the next-generation, multi-region observability platform to power the rapidly expanding AI, Data, and Observability ecosystem

Requirements

  • Deep expertise with metrics systems (Prometheus, Thanos, Mimir, Cortex), logging pipelines (Fluent Bit, Vector, Loki, ELK/Opensearch), and tracing platforms (Jaeger, Tempo, OpenTelemetry)
  • Strong programming skills in Go or Python for automation, operators, and custom integrations
  • Experience running observability platforms on Kubernetes and operating them at scale across multi-datacenter environments
  • Demonstrated skill in crafting, optimizing, and scaling telemetry pipelines handling high cardinality and high efficiency data
  • Solid understanding of distributed systems, performance engineering, and debugging complex workloads
  • Familiarity with service meshes, networking, and workload instrumentation (Envoy, Istio, OpenTelemetry SDKs)
  • Experience with data platforms and telemetry ingestion, storage, and querying layers

Responsibilities

  • Designing and operating scalable observability systems (metrics, logging, tracing) across multi-datacenter Kubernetes environments
  • Architecting end-to-end observability pipelines, including ingestion, storage, querying, and visualization
  • Extending monitoring and alerting with Prometheus, Alertmanager, Thanos/Mimir, Grafana, and OpenTelemetry
  • Building scalable log collection and processing pipelines with Fluent Bit, Vector, Loki, or ELK/Opensearch stacks
  • Implementing distributed tracing platforms (Tempo, Jaeger, OpenTelemetry) and integrating with service meshes, load balancers, and APIs
  • Defining and driving adoption of SLOs, SLIs, and error budgets across services and teams
  • Automating provisioning and scaling of observability infrastructure with Kubernetes, Terraform, and custom tooling (Go, Python)

Other

  • BS or MS in EE, ECE, CS, or equivalent experience
  • 8+ years of experience with distributed systems, with a focus on observability and monitoring systems
  • Collaboration skills and the ability to influence engineering teams to adopt observability guidelines
  • Ability to mentor engineers and shape Nvidia observability strategy and technical roadmap
  • Commitment to fostering a diverse work environment and being an equal opportunity employer