NVIDIA is looking to solve the problem of managing large-scale HPC systems for AI/ML workloads, including deployment of compute, networking, and storage, to enable researchers and engineers to develop the next generation of AI/ML systems.
Requirements
- Background in managing AI/HPC job schedulers like Slurm, K8s, PBS, RTDA, BCM (formerly known as Bright), or LSF
- Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
- Proven understanding of cluster configuration management tools (Ansible, Puppet, Salt, etc.)
- Container technologies (Docker, Singularity, Podman, Shifter, Charliecloud)
- Python programming, and bash scripting.
- Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking
- Experience with AI/ML concepts, algorithms, models, and frameworks (PyTorch, Tensorflow)
Responsibilities
- Support day-to-day operations of production on-premises and multi-cloud AI/HPC clusters, ensuring system health, user satisfaction, and efficient resource utilization.
- Directly administer internal research clusters, conduct upgrades, incident response, and reliability improvements.
- Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions.
- Maintain heterogeneous AI/ML clusters on-premises and in the cloud.
- Support our researchers to run their workloads including performance analysis and optimizations
- Analyze and optimize cluster efficiency, job fragmentation, and GPU waste to meet internal SLA targets.
- Support root cause analysis and suggest corrective action.
Other
- Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
- Minimum 2 years of experience administering multi-node compute infrastructure
- Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.
- Participate in a shared on-call rotation
- Travel requirements not specified