Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

NVIDIA Logo

AI and ML HPC Cluster Engineer

NVIDIA

$120,000 - $189,750
Jan 2, 2026
Santa Clara, CA, US
Apply Now

NVIDIA is looking to solve the problem of managing large-scale HPC systems for AI/ML workloads, including deployment of compute, networking, and storage, to enable researchers and engineers to develop the next generation of AI/ML systems.

Requirements

  • Background in managing AI/HPC job schedulers like Slurm, K8s, PBS, RTDA, BCM (formerly known as Bright), or LSF
  • Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
  • Proven understanding of cluster configuration management tools (Ansible, Puppet, Salt, etc.)
  • Container technologies (Docker, Singularity, Podman, Shifter, Charliecloud)
  • Python programming, and bash scripting.
  • Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking
  • Experience with AI/ML concepts, algorithms, models, and frameworks (PyTorch, Tensorflow)

Responsibilities

  • Support day-to-day operations of production on-premises and multi-cloud AI/HPC clusters, ensuring system health, user satisfaction, and efficient resource utilization.
  • Directly administer internal research clusters, conduct upgrades, incident response, and reliability improvements.
  • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions.
  • Maintain heterogeneous AI/ML clusters on-premises and in the cloud.
  • Support our researchers to run their workloads including performance analysis and optimizations
  • Analyze and optimize cluster efficiency, job fragmentation, and GPU waste to meet internal SLA targets.
  • Support root cause analysis and suggest corrective action.

Other

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum 2 years of experience administering multi-node compute infrastructure
  • Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.
  • Participate in a shared on-call rotation
  • Travel requirements not specified