Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

NVIDIA Logo

Director, Technical Program Management - AI and ML Platforms

NVIDIA

$264,000 - $402,500
Oct 30, 2025
Santa Clara, CA, US
Apply Now

NVIDIA DGX Cloud organization needs to build and operate AI infrastructure to support AI/ML platform initiatives, enabling researchers to develop, train, and deploy AI models on a global scale. The role aims to accelerate NVIDIA's research and product innovation by delivering a resilient, high-performance AI platform that integrates hardware, orchestration, and developer productivity, while evolving to meet the scale and complexity of AI workloads.

Requirements

  • Deep technical understanding of AI/ML workflows, job scheduling (Slurm, Kubernetes, hybrid orchestration), and large-scale distributed systems.
  • Proficiency in optimizing resource usage and monitoring performance metrics in compute-heavy settings.
  • Experience building platforms across cloud and on-prem hybrid architectures, integrating with internal and external MLOps stacks.
  • Proficiency with observability and telemetry tools (e.g., Grafana, Prometheus) for infrastructure monitoring and performance analysis.
  • Demonstrated success in implementing AI and machine learning systems and platform initiatives at a large scale encompassing workload coordination, data pipeline incorporation, model training environments, and GPU fleet supervision.
  • Proficient in AI/ML systems, model lifecycle oversight, and developer tools for extensive training tasks.
  • Deep familiarity with cloud compute and orchestration technologies, and a passion for automation and operational excellence.

Responsibilities

  • Lead and scale the Technical Program Management organization responsible for the DGX Cloud AI/ML platform, enabling over 1,000+ NVIDIA researchers globally.
  • Drive the roadmap for end-to-end AI/ML infrastructure, spanning cluster bring-up, workload orchestration, GPU resource management, and integration with MLOps pipelines.
  • Lead complex programs involving next-generation systems (e.g., GB200) and fleet-wide scaling initiatives across OCI, GCP, and other hyperscalers.
  • Own platform efficiency and capacity management, using deep understanding of scheduling systems (e.g., Slurm, hybrid models) to optimize job placement, utilization, and turnaround time.
  • Establish data-driven operational metrics availability, occupancy, wait times, throughput and use them to guide continuous improvement and prioritization.
  • Implement governance and visibility frameworks that drive alignment, predictability, and accountability across AI platform initiatives.
  • Represent DGX Cloud programs to senior leadership, clearly articulating impact, risk, and value across engineering and research organizations.

Other

  • 15+ overall years of technical program management experience, including 7+ years leading and developing TPM teams in infrastructure, AI/ML, or platform engineering domains.
  • Collaborate with leaders in technology and innovation to outline platform needs, synchronize computing approach with AI model advancement, and provide a seamless researcher journey.
  • Bachelor or Master in Computer Science, Engineering, or related field (or equivalent experience).
  • Track record driving R&D productivity platforms and reducing friction for machine learning practitioners.
  • Experience in new product introduction (NPI) for research and infrastructure systems.