Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

NVIDIA Logo

Senior ML Platform Engineer - Lepton

NVIDIA

$184,000 - $356,500
Dec 3, 2025
Santa Clara, CA, US
Apply Now

NVIDIA is looking to accelerate the next era of machine learning innovation by building and scaling high-performance ML infrastructure

Requirements

  • Strong proficiency in Infrastructure-as-Code (IaC) tools, specifically Ansible and Terraform, with a proven track record of building and managing production infrastructure.
  • SRE-oriented mindset with extensive experience in diagnosing system-level issues, performance tuning, and ensuring platform reliability.
  • Solid understanding of ML workflows and lifecycle—from data preprocessing to deployment.
  • Proficiency in operating containerized workloads with Kubernetes and Docker.
  • Strong software engineering skills in languages such as Python or Go, with a focus on automation, tooling, and writing production-grade code.
  • Experience with Linux systems internals, networking, and performance tuning at scale.
  • Proficiency in Python or Go

Responsibilities

  • Design, build, and maintain our core ML platform infrastructure as code, primarily using Ansible and Terraform, ensuring reproducibility and scalability across large-scale, distributed GPU clusters.
  • Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the entire stack, ensuring high availability and performance for critical AI workloads.
  • Develop robust internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations, with a strong focus on software engineering best practices.
  • Collaborate with ML researchers and applied scientists to understand infrastructure needs and build solutions that streamline their end-to-end experimentation.
  • Evolv and operate our multi-cloud and hybrid (on-prem + cloud) environments, implementing monitoring, alerting, and incident response protocols.
  • Participate in on-call rotation to provide support for platform services and infrastructure running critical ML jobs, driving root cause analysis and implementing preventative measures.
  • Write high-quality, maintainable code (Python, Go) to contribute to the core orchestration platform and automate manual processes.

Other

  • BS/MS in Computer Science, Engineering, or equivalent experience.
  • 8+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems.
  • Travel requirements not specified
  • Must be eligible to work in the country where the job is located
  • NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.