NVIDIA DGX Cloud organization needs to build and operate AI infrastructure to support AI/ML platform initiatives, enabling researchers to develop, train, and deploy AI models on a global scale. The role aims to accelerate NVIDIA's research and product innovation by delivering a resilient, high-performance AI platform that integrates hardware, orchestration, and developer productivity, while evolving to meet the scale and complexity of AI workloads.
Requirements
- Deep technical understanding of AI/ML workflows, job scheduling (Slurm, Kubernetes, hybrid orchestration), and large-scale distributed systems.
- Proficiency in optimizing resource usage and monitoring performance metrics in compute-heavy settings.
- Experience building platforms across cloud and on-prem hybrid architectures, integrating with internal and external MLOps stacks.
- Proficiency with observability and telemetry tools (e.g., Grafana, Prometheus) for infrastructure monitoring and performance analysis.
- Demonstrated success in implementing AI and machine learning systems and platform initiatives at a large scale encompassing workload coordination, data pipeline incorporation, model training environments, and GPU fleet supervision.
- Proficient in AI/ML systems, model lifecycle oversight, and developer tools for extensive training tasks.
- Deep familiarity with cloud compute and orchestration technologies, and a passion for automation and operational excellence.
Responsibilities
- Lead and scale the Technical Program Management organization responsible for the DGX Cloud AI/ML platform, enabling over 1,000+ NVIDIA researchers globally.
- Drive the roadmap for end-to-end AI/ML infrastructure, spanning cluster bring-up, workload orchestration, GPU resource management, and integration with MLOps pipelines.
- Lead complex programs involving next-generation systems (e.g., GB200) and fleet-wide scaling initiatives across OCI, GCP, and other hyperscalers.
- Own platform efficiency and capacity management, using deep understanding of scheduling systems (e.g., Slurm, hybrid models) to optimize job placement, utilization, and turnaround time.
- Establish data-driven operational metrics availability, occupancy, wait times, throughput and use them to guide continuous improvement and prioritization.
- Implement governance and visibility frameworks that drive alignment, predictability, and accountability across AI platform initiatives.
- Represent DGX Cloud programs to senior leadership, clearly articulating impact, risk, and value across engineering and research organizations.
Other
- 15+ overall years of technical program management experience, including 7+ years leading and developing TPM teams in infrastructure, AI/ML, or platform engineering domains.
- Collaborate with leaders in technology and innovation to outline platform needs, synchronize computing approach with AI model advancement, and provide a seamless researcher journey.
- Bachelor or Master in Computer Science, Engineering, or related field (or equivalent experience).
- Track record driving R&D productivity platforms and reducing friction for machine learning practitioners.
- Experience in new product introduction (NPI) for research and infrastructure systems.