Crusoe's mission is to accelerate the abundance of energy and intelligence by crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability. This role will help design, build, and operate Crusoe’s next-generation Kubernetes platform across global datacenters, ensuring performance, reliability, and automation at every layer of the stack.
Requirements
- Strong experience running Kubernetes on bare metal (not just managed services)
- Expert-level knowledge of Linux internals (cgroups, namespaces, kernel networking)
- Deep experience with CNIs (Cilium, Calico), load balancers (Envoy, HAProxy, F5), and L3 networking (BGP, ECMP)
- Proven track record provisioning and operating physical servers at scale (PXE/iPXE, Tinkerbell, MAAS, BMC/IPMI automation)
- Strong programming skills in Go for building operators, controllers, and automation tooling
- Hands-on experience with distributed storage systems (Ceph, MinIO, Rook, CSI drivers)
- Strong background in observability (Prometheus, Alertmanager, metrics autoscaling, logging/ELK)
Responsibilities
- Designing, building, and operating Kubernetes clusters on bare metal at scale
- Engineering full cluster lifecycle management (Talos bootstrapping, upgrades, node reprovisioning, HA control planes, recovery workflows)
- Architecting networking, load balancing, and service mesh solutions optimized for bare metal
- Implementing performant CNIs (Calico, Cilium), integrating L2/L3 networking, routing (BGP/ECMP), and optimizing traffic across racks and datacenters
- Automating provisioning via PXE/iPXE, Tinkerbell, MAAS, and managing BMCs/IPMI/Redfish with standardized BIOS/firmware across heterogeneous hardware fleets
- Designing and operating persistent storage (local disks, block, object) including Ceph, Rook, and openEBS
- Building automation and tooling (Go, Python, Bash) for provisioning, drift detection, upgrades, and incident response
Other
- 10+ years in infrastructure engineering, including 3+ years operating Kubernetes in production
- Familiarity with PKI, identity, and secrets management (Vault, cert-manager)
- Excellent debugging skills for complex distributed systems
- Strong communication and collaboration across cross-functional teams
- Experience with hardware fleet management across multiple datacenters