Crusoe is building the World’s Favorite AI-first Cloud infrastructure company and needs to develop its next-generation orchestration platform to power GPU-accelerated and high-performance computing at scale.
Requirements
- 8+ years of software engineering experience in distributed systems, cloud, or HPC.
- Proven track record of technical leadership and driving architecture in production systems.
- Deep expertise in Kubernetes internals (control plane, operators, API machinery, scheduling).
- Strong proficiency in Go (preferred) or another systems language (Rust, C++, Python for HPC tooling).
- Extensive experience with GPU integration in Kubernetes (device plugins, GPU operators, resource allocation).
- Strong knowledge of container networking (Cilium, Calico, Multus, service meshes) and Linux networking fundamentals.
- Familiarity with high-performance networking technologies (InfiniBand, RoCE) and accelerator-aware scheduling.
Responsibilities
- Lead architecture and design for core features of Crusoe’s Managed Kubernetes platform (multi-tenancy, control plane scalability, cluster lifecycle, and high availability).
- Drive integration of GPU acceleration in Kubernetes, including device plugin architecture, GPU operators, scheduling, autoscaling, and monitoring.
- Guide development of advanced container networking capabilities, including CNI plugins, network operators, service meshes, and high-performance fabrics (InfiniBand, RoCE).
- Define and enforce best practices for security, multi-cluster deployments, and workload isolation across compute, GPU, and networking layers.
- Partner with product and engineering leadership to set long-term technical strategy and roadmap for CMK.
- Mentor engineers across the organization, providing technical guidance and elevating standards for design, code quality, and operational excellence.
- Troubleshoot and resolve complex distributed systems challenges spanning compute, networking, and GPU acceleration.
Other
- Ability to influence cross-functional teams to deliver reliable, scalable, and secure orchestration for mission-critical workloads.
- Contribute to and represent Crusoe in open-source communities (Kubernetes SIGs, CNCF projects, GPU and networking ecosystem).
- Familiarity with both NVIDIA and AMD GPU stacks (CUDA, ROCm, NCCL).
- Experience with Slurm, MPI, Ray, or distributed ML frameworks (TensorFlow, PyTorch, JAX).
- Contributions to open-source projects in the Kubernetes, GPU, or networking ecosystems.