Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

CRUSOE Logo

Staff Software Engineer, Slurm

CRUSOE

$185,000 - $224,000
Sep 29, 2025
San Francisco, CA, USA
Apply Now

Crusoe's mission is to accelerate the abundance of energy and intelligence by crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability. The company is seeking to build and operate Slurm as a fully managed cloud service within Crusoe Cloud to deliver next-generation orchestration capabilities for GPU-accelerated and high-performance computing (HPC) at scale.

Requirements

  • 7+ years of experience working in software engineering, with strong experience in Systems Engineering. Experience in distributed systems, cloud, or HPC environments is a must
  • 2+ years of programming experience in GoLang. Strong proficiency in other systems languages (Rust, C++, Python for HPC tooling) is also beneficial.
  • Extensive experience with Kubernetes and Linux Engineering and debugging.
  • Deep knowledge of Slurm (Simple Linux Utility for Resource Management) administration and the architecture required for managing compute jobs in high-performance environments.
  • Skilled in infrastructure as code and familiar with systems-level challenges, ideally with experience utilizing Terraform.
  • Understand Argo, CI/CD, and Automated Testing pipelines. You can design system architecture, taking ownership of system architecture, including CI/CD pipelines, while ensuring adherence to security standards.
  • Strong knowledge of container networking (CNI plugins, service meshes) and Linux networking fundamentals.

Responsibilities

  • Lead the development and engineering of our managed Slurm offering, providing a seamless experience for AI/ML and HPC customers who rely on robust Slurm job scheduling.
  • Contribute to the development of scalable and robust software solutions, closely aligning with the strategic objectives outlined in the Crusoe Cloud roadmap.
  • Design, build, and maintain Kubernetes operators and controllers dedicated to managing the lifecycle, configuration, and state of large-scale Slurm clusters.
  • Drive the integration of GPU acceleration in the Slurm environment, including device plugin architecture, GPU operators, accelerator-aware scheduling, and resource allocation.
  • Ensure that high-performance networking technologies, such as InfiniBand and RoCE, are correctly leveraged for distributed GPU workloads running through Slurm.
  • Implement and manage features such as multi-tenancy, cluster lifecycle management, auto-scaling, and high availability for the managed Slurm control plane services.
  • Develop scalable systems to compete with leading managed services.

Other

  • Support the development of your peers by sharing knowledge and providing guidance in technical discussions.
  • Excellent communication skills, both verbal and written.