CoreWeave is seeking to solve the problem of delivering a reliable and performant GPU infrastructure experience for AI development by evolving its GPU performance testing platform and ensuring high standards of latency, throughput, and reliability across global infrastructure.
Requirements
- ~5-8 years experience
- Strong proficiency in Go and/or Python software development.
- Hands-on experience with Kubernetes at production scale.
- Experience testing hardware at scale
- HPC Experience
- Experience with AI/ML infrastructure and training / inference
Responsibilities
- Design and implement solutions to problems of scale for testing and validation of CoreWeave’s global infrastructure.
- Build and maintain backend services and APIs (gRPC/REST) in Go or Python to interact with Kubernetes and other infrastructure systems.
- Develop performance tests and automation workflows to expand hardware validation across the CoreWeave fleet.
- Write and maintain Kubernetes custom controllers and operators to automate infrastructure testing.
- Adapt and extend open source tooling to enhance visibility into system metrics, performance, and health.
- Lead and collaborate on cross-team efforts related to infrastructure testing and provisioning.
- Mentor GPU performance team members.
Other
- Participate in an on-call rotation.
- Shape and influence development and deployment policies for the GPU performance team.
- Hybrid work environment with potential for remote work for candidates located more than 30 miles from an office.
- Must be a U.S. person or eligible to access export controlled information without requiring export authorization.
- Ability to attend onboarding at one of CoreWeave’s hubs within the first month.