Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

NVIDIA Logo

Principal ML Platform Engineer

NVIDIA

$272,000 - $425,500
Sep 18, 2025
Santa Clara, CA, US
Apply Now

NVIDIA is looking for an ML Platform Engineer to accelerate the next era of machine learning innovation by architecting, scaling, and optimizing high-performance ML infrastructure.

Requirements

  • 15+ years in software/platform engineering, including 3+ years in ML infrastructure or distributed compute systems.
  • Solid understanding of ML training/inference workflows and lifecycle—from data preprocessing to deployment.
  • Proficiency in crafting and operating containerized workloads with Kubernetes, Docker, and workload schedulers.
  • Experience with ML orchestration tools such as Kubeflow, Flyte, Airflow, or Ray.
  • Strong coding skills in languages such as Python, Go, or Rust.
  • Experience running Slurm or custom scheduling frameworks in production ML environments.
  • Familiarity with GPU computing, Linux systems internals, and performance tuning at scale.

Responsibilities

  • Design, build, and maintain scalable ML platforms and infrastructure for training and inference on large-scale, distributed GPU clusters.
  • Develop internal tools and automation for ML workflow orchestration, resource scheduling, data access, and reproducibility.
  • Collaborate with ML researchers and applied scientists to optimize performance and streamline end-to-end experimentation.
  • Evolve and operate multi-cloud and hybrid (on-prem + cloud) environments with a focus on high availability and performance for AI workloads.
  • Define and monitor ML-specific infrastructure metrics, such as model efficiency, resource utilization, job success rates, and pipeline latency.
  • Build tooling to support experimentation tracking, reproducibility, model versioning, and artifact management.
  • Drive the adoption of modern GPU technologies and ensure smooth integration of next-generation hardware into ML pipelines (e.g., GB200, NVLink, etc.).

Other

  • BS/MS in Computer Science, Engineering, or equivalent experience.
  • Participate in on-call support for platform services and infrastructure running critical ML jobs.
  • Passion for building developer-centric platforms with great UX and strong operational reliability.
  • Applications for this job will be accepted at least until September 22, 2025.
  • NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.