CoreWeave is seeking to enhance the performance, reliability, and scalability of its GPU performance testing platform to ensure a reliable and performant experience for its customers in AI infrastructure.
Requirements
- ~3–5 years experience
- Proficiency in Go and/or Python software development.
- Hands-on experience with Kubernetes at production scale.
- Experience testing hardware at scale
- HPC Experience
- Experience with AI/ML infrastructure and training / inference
Responsibilities
- Design and implement solutions to problems of scale for testing and validation of CoreWeave’s global infrastructure.
- Build and maintain backend services and APIs (gRPC/REST) in Go or Python to interact with Kubernetes and other infrastructure systems.
- Develop performance tests and automation workflows to expand hardware validation across the CoreWeave fleet.
- Write and maintain Kubernetes custom controllers and operators to automate infrastructure testing.
- Adapt and extend open source tooling to enhance visibility into system metrics, performance, and health.
- Participate in an on-call rotation.
Other
- Must be a U.S. person or eligible to access export controlled information without required export authorization or eligible and reasonably likely to obtain the required export authorization from the applicable U.S. government agency.
- Flexible, hybrid work environment with potential remote work for candidates more than 30 miles from an office.
- New hires must attend onboarding at one of CoreWeave’s hubs within their first month.
- Teams gather quarterly for collaboration.