Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Principal Software Engineer, AIOps and Observability

NVIDIA

$248,000 - $391,000

Dec 19, 2025

Santa Clara, CA, US

NVIDIA is looking to design and develop AIOps & Observability platforms to monitor, diagnose, and optimize products, assets, and services in cloud, on-prem, data centers, supply chain, and edge.

Requirements

Strong knowledge and experience with observability tools, such as Prometheus, Victoria Metrics, Vector, Loki, Grafana, Alert Manager, Clickhouse, OpenTelemetry, etc.
Hands-on knowledge in AIOps tools such as BigPanda, PagerDuty, Datadog, etc.
Experience with Kubernetes, Nomad, Docker, and microservices architectures as well as experience with streaming services to ingest billions of events using NATS, Kafka, etc
Proficient in one or more programming languages, such as Go, Python, Java, C-Sharp, etc.
Experience with developing Observability solutions to monitor On-prem and Public Cloud environments.
Experience with running large Observability platforms on BareMetal Infrastructure
Establish scalable data pipelines and instrumentation for collecting, aggregating, and visualizing telemetry and operational metrics.

Responsibilities

Lead the design, development, and deployment of AIOps & Observability platforms, including metrics, logs, traces, events, alerts, dashboards, and visualizations.
Drive the technical vision and roadmap for AIOps and Observability initiatives, aligning with business goals and industry best practices.
Collaborate with other teams and customers to understand their observability needs and provide solutions that meet their requirements and expectations.
Establish and implement observability standards, guidelines, and processes across NVIDIA.
Provide peer reviews to other engineers including feedback on performance, scalability, security and correctness.
Work with Data scientists to implement machine learning models for anomaly detection, forecasting, and root cause analysis on logs, metrics, and events.
Develop and operate scalable, reliable, and distributed systems that can handle high traffic and complex workloads.

Other

Bachelor’s degree in computer science and engineering, or related field, or equivalent experience.
15+ years of experience in product development and full stack engineering, with 5+ years of experience in developing and operating observability platforms and solutions, preferably in a cloud-native environment.
Passionate about observability and delivering high-quality internal platforms.
Travel requirements not specified
Clearance requirements not specified
LI-Hybrid, base salary will be determined based on location, experience, and pay of employees in similar positions.