Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

AI and ML HPC Cluster Engineer

NVIDIA

$120,000 - $189,750

Jan 2, 2026

Santa Clara, CA, US

NVIDIA is looking to solve the problem of managing large-scale HPC systems for AI/ML workloads, including deployment of compute, networking, and storage, to enable researchers and engineers to develop the next generation of AI/ML systems.

Requirements

Background in managing AI/HPC job schedulers like Slurm, K8s, PBS, RTDA, BCM (formerly known as Bright), or LSF
Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
Proven understanding of cluster configuration management tools (Ansible, Puppet, Salt, etc.)
Container technologies (Docker, Singularity, Podman, Shifter, Charliecloud)
Python programming, and bash scripting.
Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking
Experience with AI/ML concepts, algorithms, models, and frameworks (PyTorch, Tensorflow)

Responsibilities

Support day-to-day operations of production on-premises and multi-cloud AI/HPC clusters, ensuring system health, user satisfaction, and efficient resource utilization.
Directly administer internal research clusters, conduct upgrades, incident response, and reliability improvements.
Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions.
Maintain heterogeneous AI/ML clusters on-premises and in the cloud.
Support our researchers to run their workloads including performance analysis and optimizations
Analyze and optimize cluster efficiency, job fragmentation, and GPU waste to meet internal SLA targets.
Support root cause analysis and suggest corrective action.

Other

Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
Minimum 2 years of experience administering multi-node compute infrastructure
Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.
Participate in a shared on-call rotation
Travel requirements not specified