Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

NVIDIA Logo

Senior AI and ML HPC Cluster Engineer

NVIDIA

$136,000 - $264,500
Oct 18, 2025
Santa Clara, CA, US
Apply Now

NVIDIA is looking to identify architectural changes and/or completely new approaches for their GPU Compute Clusters to handle demanding deep learning, high performance computing, and computationally intensive workloads.

Requirements

  • Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, PBS, RTDA or LSF
  • Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
  • Solid understanding of cluster configuration managements tools such as Ansible, Puppet, Salt
  • In depth understating of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud
  • Proficiency in Python programming and bash scripting
  • Applied experience with AI/HPC workflows that use MPI
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads.

Responsibilities

  • Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage.
  • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions
  • Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud
  • Support our researchers to run their workloads including performance analysis and optimizations
  • Conduct root cause analysis and suggest corrective action
  • Proactively find and fix issues before they occur

Other

  • Minimum 5+ years of experience designing and operating large scale compute infrastructure
  • Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.
  • Background with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking
  • Experience with Machine Learning and Deep Learning concepts, algorithms and models
  • Familiarity with InfiniBand with IBOP and RDMA