NVIDIA is looking to identify architectural changes and/or completely new approaches for their GPU Compute Clusters to handle demanding deep learning, high performance computing, and computationally intensive workloads.
Requirements
- Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF
- Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
- Solid understanding of cluster configuration managements tools such as Ansible, Puppet, Salt
- In depth understating of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud
- Proficiency in Python programming and bash scripting
- Applied experience with AI/HPC workflows that use MPI
- Experience analyzing and tuning performance for a variety of AI/HPC workloads.
Responsibilities
- Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage.
- Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions
- Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud
- Support our researchers to run their workloads including performance analysis and optimizations
- Conduct root cause analysis and suggest corrective action
- Proactively find and fix issues before they occur
Other
- Minimum 5+ years of experience designing and operating large scale compute infrastructure
- Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.
- Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking
- Experience with Machine Learning and Deep Learning concepts, algorithms and models
- Familiarity with InfiniBand with IBOP and RDMA