NVIDIA is seeking a software engineer to build, operate, and maintain cloud-hosted services for user and service authentication/authorization, focusing on ensuring continuity of operations, reliability, performance, and scalability.
Requirements
- Hands-on experience with modern monitoring systems (Prometheus, Grafana, Loki, Tempo, Datadog, New Relic, OpenTelemetry, etc.) within a production environment.
- Advanced coding skills in Python, Go, or similar languages for building automation and integrating observability solutions.
- Comfort with JavaScript frameworks such as React and Next.js.
- Proficiency in cloud platforms (AWS, GCP, Azure) and containerized environments (Kubernetes, Docker); experience with configuration-as-code tools (Terraform, Helm, Ansible).
- Experience with incident management, postmortem processes.
- Familiarity with the Java Spring Boot framework, hands-on experience with Apache Cassandra and HashiCorp Vault would be very advantageous.
- Having relevant coding experience and being open to supporting development would be a huge plus.
Responsibilities
- Architect, implement, and maintain observability systems at scale to enable monitoring, alerting, logging, and tracing for our cloud-based services.
- Define and refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets in partnership with service owners and product teams.
- Invent, construct, and uphold actionable dashboards that display important measurements, SLI/SLOs, and system health for distributed services.
- Collaborate with software, platform, and networking teams to integrate observability at all stages of the application lifecycle, from development to incident response.
- Drive automation efforts to reduce manual toil in monitoring, telemetry, and incident response workflows; build and maintain self-service observability tooling.
- Address performance and reliability issues by bringing to bear root cause analysis, distributed tracing, and log correlation.
- Participate in Pager Duty rotations, contribute to post-incident reviews, detailing findings and driving solutions that improve long-term system resilience and visibility.
Other
- 8+ years in large-scale systems engineering roles with exposure to dealing with live service development, working end-to-end from service development, deployment, and observability, as well as being on-call.
- Strong communication and collaboration skills, with experience working in global, cross-disciplinary teams.
- Detailed, analytical problem-solving approach and high standards for operational excellence and customer happiness.
- Develop expertise in the functions and capabilities of our offerings, and assist in managing our support channels for other NVIDIA teams.