NVIDIA is building the best cloud offering for AI workloads and bringing its latest GPU technology to clients as managed services under the DGX Cloud umbrella, requiring scalable managed self-service APIs for easy access to NVIDIA products.
Requirements
- Solid technical foundation in distributed computing and storage, including substantial experience with all of the following: server systems, storage, I/O, networking, and system software
- 12+ years of platform engineering experience on large-scale production systems
- Kubernetes and IaC expertise as an engineer
- General shared storage knowledge such as NFS, LustreFS, GlusterFS, etc.
- Familiarity with system-level architecture, such as interconnects, memory hierarchy, interrupts, and memory-mapped IO.
- Proven experience in high performance computing, Deep Learning, and/or GPU accelerated computing domains
- Large-scale distributed system, HPC, ML and Training experience with Slurm and Kubernetes
Responsibilities
- As a part of the service team, build and design platforms for DGX Cloud services
- Figure out how to take best from HPC and Kubernetes and help us make the unified platform
- Work within the team of software engineers and product people as well as engineering teams across all of NVIDIA on DGX Cloud AI Compute services
- Write IaC code, work on Kubernetes, and help the team to design and implement release pipelines
- Collaborate to understand how to make the best use of GitOps and Pipelines
Other
- BS in Computer Science, Information Systems, Computer Engineering or equivalent experience
- Ability to understand and communicate complex designs, distributed infrastructure, and requirements to peers, customers, and vendors
- Deep knowledge of both software and hardware knowledge in HPC and ML infrastructure
- Applications for this job will be accepted at least until September 22, 2025.
- NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.