NVIDIA is looking to design and develop AIOps & Observability platforms to monitor, diagnose, and optimize products, assets, and services in cloud, on-prem, data centers, supply chain, and edge.
Requirements
- Strong knowledge and experience with observability tools, such as Prometheus, Victoria Metrics, Vector, Loki, Grafana, Alert Manager, Clickhouse, OpenTelemetry, etc.
- Hands-on knowledge in AIOps tools such as BigPanda, PagerDuty, Datadog, etc.
- Experience with Kubernetes, Nomad, Docker, and microservices architectures as well as experience with streaming services to ingest billions of events using NATS, Kafka, etc
- Proficient in one or more programming languages, such as Go, Python, Java, C-Sharp, etc.
- Experience with developing Observability solutions to monitor On-prem and Public Cloud environments.
- Experience with running large Observability platforms on BareMetal Infrastructure
- Establish scalable data pipelines and instrumentation for collecting, aggregating, and visualizing telemetry and operational metrics.
Responsibilities
- Lead the design, development, and deployment of AIOps & Observability platforms, including metrics, logs, traces, events, alerts, dashboards, and visualizations.
- Drive the technical vision and roadmap for AIOps and Observability initiatives, aligning with business goals and industry best practices.
- Collaborate with other teams and customers to understand their observability needs and provide solutions that meet their requirements and expectations.
- Establish and implement observability standards, guidelines, and processes across NVIDIA.
- Provide peer reviews to other engineers including feedback on performance, scalability, security and correctness.
- Work with Data scientists to implement machine learning models for anomaly detection, forecasting, and root cause analysis on logs, metrics, and events.
- Develop and operate scalable, reliable, and distributed systems that can handle high traffic and complex workloads.
Other
- Bachelor’s degree in computer science and engineering, or related field, or equivalent experience.
- 15+ years of experience in product development and full stack engineering, with 5+ years of experience in developing and operating observability platforms and solutions, preferably in a cloud-native environment.
- Passionate about observability and delivering high-quality internal platforms.
- Travel requirements not specified
- Clearance requirements not specified
- LI-Hybrid, base salary will be determined based on location, experience, and pay of employees in similar positions.