NVIDIA is looking for an ML Platform Engineer to accelerate the next era of machine learning innovation by architecting, scaling, and optimizing high-performance ML infrastructure.
Requirements
- 15+ years in software/platform engineering, including 3+ years in ML infrastructure or distributed compute systems.
- Solid understanding of ML training/inference workflows and lifecycle—from data preprocessing to deployment.
- Proficiency in crafting and operating containerized workloads with Kubernetes, Docker, and workload schedulers.
- Experience with ML orchestration tools such as Kubeflow, Flyte, Airflow, or Ray.
- Strong coding skills in languages such as Python, Go, or Rust.
- Experience running Slurm or custom scheduling frameworks in production ML environments.
- Familiarity with GPU computing, Linux systems internals, and performance tuning at scale.
Responsibilities
- Design, build, and maintain scalable ML platforms and infrastructure for training and inference on large-scale, distributed GPU clusters.
- Develop internal tools and automation for ML workflow orchestration, resource scheduling, data access, and reproducibility.
- Collaborate with ML researchers and applied scientists to optimize performance and streamline end-to-end experimentation.
- Evolve and operate multi-cloud and hybrid (on-prem + cloud) environments with a focus on high availability and performance for AI workloads.
- Define and monitor ML-specific infrastructure metrics, such as model efficiency, resource utilization, job success rates, and pipeline latency.
- Build tooling to support experimentation tracking, reproducibility, model versioning, and artifact management.
- Drive the adoption of modern GPU technologies and ensure smooth integration of next-generation hardware into ML pipelines (e.g., GB200, NVLink, etc.).
Other
- BS/MS in Computer Science, Engineering, or equivalent experience.
- Participate in on-call support for platform services and infrastructure running critical ML jobs.
- Passion for building developer-centric platforms with great UX and strong operational reliability.
- Applications for this job will be accepted at least until September 22, 2025.
- NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.