Enhancing efficiency for NVIDIA's researchers by implementing progressions throughout the entire stack, pinpointing and addressing infrastructure deficiencies, and facilitating groundbreaking AI and ML research on GPU Clusters.
Requirements
- Hands-on experience in using or operating High Performance Computing (HPC) grade infrastructure as well as in-depth knowledge of accelerated computing (e.g., GPU, custom silicon), storage (e.g., Lustre, GPFS, BeeGFS), scheduling & orchestration (e.g., Slurm, Kubernetes, LSF), high-speed networking (e.g., Infiniband, RoCE, Amazon EFA), and containers technologies (Docker, Enroot).
- Capability in supervising and improving substantial distributed training operations using PyTorch (DDP, FSDP), NeMo, or JAX. Moreover, an in-depth understanding of AI/ML workflows, involving data processing, model training, and inference pipelines.
- Proficiency in programming & scripting languages such as Python, Go, Bash, as well as familiarity with cloud computing platforms (e.g., AWS, GCP, Azure) in addition to experience with parallel computing frameworks and paradigms.
- 15+ years of demonstrated expertise in AI/ML and HPC tasks and systems.
- Dedication to ongoing learning and staying updated on new technologies and innovative methods in the AI/ML infrastructure sector.
Responsibilities
- Engage closely with our AI and ML research teams to discern their infrastructure requirements and barriers, converting those insights into actionable improvements.
- Proactively identify researcher efficiency bottlenecks and lead initiatives to systematically improve it. Drive the direction and long-term roadmaps for such initiatives.
- Monitor and optimize the performance of our infrastructure ensuring high availability, scalability, and efficient resource utilization.
- Help define and improve important measures of AI researcher efficiency, ensuring that our actions are in line with measurable results.
- Work closely with a variety of teams, such as researchers, data engineers, and DevOps professionals, to develop a cohesive AI/ML infrastructure ecosystem.
- Keep up to date with the most recent developments in AI/ML technologies, frameworks, and successful strategies, and advocate for their integration within the organization.
Other
- Excellent communication and collaboration skills, with the ability to work effectively with teams and individuals of different backgrounds.
- If you're a passionate and independent engineer with a love for technology, we want to hear from you.