CoreWeave is looking to solve the problem of providing a cloud platform of cutting edge services powering the next wave of AI, by advancing its orchestration platform including SUNK (Slurm on Kubernetes) and beyond, to ensure workloads run seamlessly, reliably, and efficiently across massive GPU clusters.
Requirements
- 8+ years of professional software engineering experience
- Proven track record designing and operating large-scale distributed systems in production
- Deep expertise in Slurm/Kubernetes internals and cloud-native development
- Advanced proficiency in Go and distributed systems design
- Experience setting technical direction and influencing cross-team architecture
- Familiarity with orchestration and workflow technologies such as Ray, Kubeflow, Kueue, Istio, Knative, or Argo Workflows
- Experience with distributed workloads, GPU-based applications, or ML pipelines
Responsibilities
- Define architectural direction for CoreWeave's orchestration platform
- Own critical parts of the orchestration platform and other managed services
- Drive cross-org initiatives in scheduling, quota enforcement, and scaling at hyperscale
- Mentor senior engineers
- Establish org-wide best practices in reliability and observability
- Ensure CoreWeave's orchestration layer evolves to meet the demands of next-generation AI workloads
- Build the systems that eliminate infrastructure bottlenecks and create new orchestration capabilities
Other
- 8+ years of professional software engineering experience
- Bachelor's degree or higher in a relevant field
- Ability to work in a hybrid work environment, with remote work considered for candidates located more than 30 miles from an office
- Must be a U.S. person, defined as a U.S. citizen or national, U.S. lawful permanent resident, refugee, or asylee
- Eligible to access export controlled information without a required export authorization