NVIDIA's DGX Cloud Team is looking to optimize the efficiency and resiliency of ML workloads and develop scalable AI infrastructure tools and services to provide a stable, scalable environment for AI researchers.
Requirements
- 12+ years of hands-on experience in backend development, preferably with Python, Go, C/C++, or similar high-performance languages.
- Consistent track record of building and scaling large-scale distributed systems.
- Experience with cloud computing platforms such as AWS, Azure, and GCP, as well as container technologies like Docker and Kubernetes, and HPC/AI platforms such as Slurm.
- Real world experience in DL frameworks, orchestrators like PyTorch, TensorFlow, JAX, and Ray
- Experience in developing a framework plugin architecture that allows the framework to be integrated with the cluster scheduler visibly to the users
- Strong understanding of NVIDIA GPUs, network technologies, and their failure patterns.
- Experience with AI models and AI based tools.
Responsibilities
- Developing solutions at the intersection of machine learning, distributed systems, and high-performance computing, supplying to the advancement of AI technologies.
- Designing, developing, and optimizing (micro-)services orchestrated by Kubernetes to provide large-scale AI training workflows on AI training supercomputers located at major CSPs, with resiliency and efficiency.
- Co-designing and implementing the APIs that allow these services to integrate vertically with NVIDIA's resiliency stacks, ranging from tier-0 telemetry services to break/fix automation services to checkpoint and execution systems.
- Crafting a submission abstraction that enables model engineers and training platforms/frameworks to seamlessly submit long-running training jobs while hiding the complexity of handling infrastructure failures, running job lifecycles with auto-restarts on failure, ensuring full efficiency, and promptly advising users.
- Crafting these services to be modular, enabling them to be coordinated with and deployed onto on-premises AI clusters that apply NVIDIA Hardware and Cloud services.
Other
- A Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience).
- Provide references to your code contributions.
- NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer.