TensorWave is building a versatile cloud platform for AI compute and needs to manage distributed machine learning workloads at scale using Slurm and Kubernetes.
Requirements
- Significant hands-on experience with Slurm in production HPC/ML environments, including understanding of setup/configuration, enroot (pyxis), modules, and MPI.
- Strong knowledge of distributed ML languages and frameworks, such as Python, PyTorch, Megatron, c10d, MPI, etc.
- Understanding of node lifecycle, including health checks, prolog / epilog scripts, and draining.
- Deep understanding of security, compliance, and resilience in containerized workloads.
- 3+ years of hands-on Kubernetes experience, including deep knowledge of the Kubernetes API, internals, networking, and storage.
- Proficiency in writing Kubernetes manifests, Helm charts, and managing releases.
- Experience with DAGs using K8s native tools such as Argo Workflows.
Responsibilities
- Manage and iterate our containerized Slurm (Slurm-in-Kubernetes) solution, including customer configuration and deployment.
- Work closely with our engineering team to develop and maintain CI and automation for managed offerings.
- Ensure healthy cluster operations and uptime by implementing active and passive health checks, including automated node draining and triage.
- Help profile and debug distributed workloads, from small inference jobs to cluster-wide training.
- Establish best practices for running jobs at scale, including monitoring, checkpointing, etc.
- Mentor and upskill ML engineers in best practices.
Other
- senior-level role
- technical visionary
- hands-on expert
- Make GPUs go Brrrrrrr