Crusoe's mission is to accelerate the abundance of energy and intelligence by crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability. The company is seeking to build and operate Slurm as a fully managed cloud service within Crusoe Cloud to deliver next-generation orchestration capabilities for GPU-accelerated and high-performance computing (HPC) at scale.
Requirements
- 7+ years of experience working in software engineering, with strong experience in Systems Engineering. Experience in distributed systems, cloud, or HPC environments is a must
- 2+ years of programming experience in GoLang. Strong proficiency in other systems languages (Rust, C++, Python for HPC tooling) is also beneficial.
- Extensive experience with Kubernetes and Linux Engineering and debugging.
- Deep knowledge of Slurm (Simple Linux Utility for Resource Management) administration and the architecture required for managing compute jobs in high-performance environments.
- Skilled in infrastructure as code and familiar with systems-level challenges, ideally with experience utilizing Terraform.
- Understand Argo, CI/CD, and Automated Testing pipelines. You can design system architecture, taking ownership of system architecture, including CI/CD pipelines, while ensuring adherence to security standards.
- Strong knowledge of container networking (CNI plugins, service meshes) and Linux networking fundamentals.
Responsibilities
- Lead the development and engineering of our managed Slurm offering, providing a seamless experience for AI/ML and HPC customers who rely on robust Slurm job scheduling.
- Contribute to the development of scalable and robust software solutions, closely aligning with the strategic objectives outlined in the Crusoe Cloud roadmap.
- Design, build, and maintain Kubernetes operators and controllers dedicated to managing the lifecycle, configuration, and state of large-scale Slurm clusters.
- Drive the integration of GPU acceleration in the Slurm environment, including device plugin architecture, GPU operators, accelerator-aware scheduling, and resource allocation.
- Ensure that high-performance networking technologies, such as InfiniBand and RoCE, are correctly leveraged for distributed GPU workloads running through Slurm.
- Implement and manage features such as multi-tenancy, cluster lifecycle management, auto-scaling, and high availability for the managed Slurm control plane services.
- Develop scalable systems to compete with leading managed services.
Other
- Support the development of your peers by sharing knowledge and providing guidance in technical discussions.
- Excellent communication skills, both verbal and written.