Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

NVIDIA Logo

Principal Software Engineer, AIOps and Observability

NVIDIA

$248,000 - $391,000
Dec 19, 2025
Santa Clara, CA, US
Apply Now

NVIDIA is looking to design and develop AIOps & Observability platforms to monitor, diagnose, and optimize products, assets, and services in cloud, on-prem, data centers, supply chain, and edge.

Requirements

  • Strong knowledge and experience with observability tools, such as Prometheus, Victoria Metrics, Vector, Loki, Grafana, Alert Manager, Clickhouse, OpenTelemetry, etc.
  • Hands-on knowledge in AIOps tools such as BigPanda, PagerDuty, Datadog, etc.
  • Experience with Kubernetes, Nomad, Docker, and microservices architectures as well as experience with streaming services to ingest billions of events using NATS, Kafka, etc
  • Proficient in one or more programming languages, such as Go, Python, Java, C-Sharp, etc.
  • Experience with developing Observability solutions to monitor On-prem and Public Cloud environments.
  • Experience with running large Observability platforms on BareMetal Infrastructure
  • Establish scalable data pipelines and instrumentation for collecting, aggregating, and visualizing telemetry and operational metrics.

Responsibilities

  • Lead the design, development, and deployment of AIOps & Observability platforms, including metrics, logs, traces, events, alerts, dashboards, and visualizations.
  • Drive the technical vision and roadmap for AIOps and Observability initiatives, aligning with business goals and industry best practices.
  • Collaborate with other teams and customers to understand their observability needs and provide solutions that meet their requirements and expectations.
  • Establish and implement observability standards, guidelines, and processes across NVIDIA.
  • Provide peer reviews to other engineers including feedback on performance, scalability, security and correctness.
  • Work with Data scientists to implement machine learning models for anomaly detection, forecasting, and root cause analysis on logs, metrics, and events.
  • Develop and operate scalable, reliable, and distributed systems that can handle high traffic and complex workloads.

Other

  • Bachelor’s degree in computer science and engineering, or related field, or equivalent experience.
  • 15+ years of experience in product development and full stack engineering, with 5+ years of experience in developing and operating observability platforms and solutions, preferably in a cloud-native environment.
  • Passionate about observability and delivering high-quality internal platforms.
  • Travel requirements not specified
  • Clearance requirements not specified
  • LI-Hybrid, base salary will be determined based on location, experience, and pay of employees in similar positions.