CoreWeave is looking to solve the business and technical problem of delivering a cloud platform of cutting-edge services powering the next wave of AI, by providing enterprises and leading AI labs with the most performant, efficient, and resilient solutions for accelerated computing.
Requirements
- Strong coding in Python or Go (C++ a plus) and deep familiarity with networked systems and performance.
- Hands-on experience with Kubernetes at production scale, CI/CD, and observability stacks (Prometheus, Grafana, OpenTelemetry).
- Practical knowledge of inference internals: batching, caching, mixed precision (BF16/FP8), streaming token delivery.
- Proven track record improving tail latency (P95/P99) and service reliability through metrics-driven work.
- Contributions to inference frameworks (vLLM, Triton, TensorRT-LLM, Ray Serve, TorchServe).
- Experience with CUDA kernels, NCCL/SHARP, RDMA/NUMA, or GPU interconnect topologies.
Responsibilities
- Lead design reviews and drive architecture within the team; decompose multi-service work into clear milestones.
- Define and own SLIs/SLOs; ensure post-incident actions land and reliability improves release-over-release.
- Implement advanced optimizations (e.g., micro-batch schedulers, speculative decoding, KV-cache reuse) and quantify impact.
- Strengthen incident posture: capacity planning, autoscaling policy, graceful degradation, rollback/traffic-shift strategies.
- Mentor IC1/IC2 engineers; review cross-team designs and elevate coding/testing standards.
- Own an area spanning multiple services and teams (e.g., request routing & adaptive scheduling, cost-per-token analytics, GPU resource isolation).
- Partner with product, orchestration, and hardware teams to evolve our Kubernetes-native inference platform and meet strict P99 SLAs at scale.
Other
- ~5–8 years industry experience building distributed systems or cloud services.
- Leading multi-team initiatives or partnering with customers on mission-critical launches.
- CoreWeave is an equal opportunity employer, committed to fostering an inclusive and supportive workplace.
- This position requires access to export controlled information.
- The base salary range for this role is $165,000 to $242,000.