Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

TensorWave Logo

ML Cluster Operations Engineer

TensorWave

Salary not specified
Nov 17, 2025
Las Vegas, NV, US
Apply Now

TensorWave is building a versatile cloud platform for AI compute and needs to manage distributed machine learning workloads at scale using Slurm and Kubernetes.

Requirements

  • Significant hands-on experience with Slurm in production HPC/ML environments, including understanding of setup/configuration, enroot (pyxis), modules, and MPI.
  • Strong knowledge of distributed ML languages and frameworks, such as Python, PyTorch, Megatron, c10d, MPI, etc.
  • Understanding of node lifecycle, including health checks, prolog / epilog scripts, and draining.
  • Deep understanding of security, compliance, and resilience in containerized workloads.
  • 3+ years of hands-on Kubernetes experience, including deep knowledge of the Kubernetes API, internals, networking, and storage.
  • Proficiency in writing Kubernetes manifests, Helm charts, and managing releases.
  • Experience with DAGs using K8s native tools such as Argo Workflows.

Responsibilities

  • Manage and iterate our containerized Slurm (Slurm-in-Kubernetes) solution, including customer configuration and deployment.
  • Work closely with our engineering team to develop and maintain CI and automation for managed offerings.
  • Ensure healthy cluster operations and uptime by implementing active and passive health checks, including automated node draining and triage.
  • Help profile and debug distributed workloads, from small inference jobs to cluster-wide training.
  • Establish best practices for running jobs at scale, including monitoring, checkpointing, etc.
  • Mentor and upskill ML engineers in best practices.

Other

  • senior-level role
  • technical visionary
  • hands-on expert
  • Make GPUs go Brrrrrrr