CoreWeave is looking to solve the problem of delivering highly performant, efficient, and resilient AI cloud platforms by evolving their Kubernetes-native inference platform to meet strict P99 SLAs at scale.
Requirements
- Strong coding in Python or Go (C++ a plus) and deep familiarity with networked systems and performance.
- Hands-on experience with Kubernetes at production scale, CI/CD, and observability stacks (Prometheus, Grafana, OpenTelemetry).
- Practical knowledge of inference internals: batching, caching, mixed precision (BF16/FP8), streaming token delivery.
- Proven track record improving tail latency (P95/P99) and service reliability through metrics-driven work.
- Contributions to inference frameworks (vLLM, Triton, TensorRT-LLM, Ray Serve, TorchServe).
- Experience with CUDA kernels, NCCL/SHARP, RDMA/NUMA, or GPU interconnect topologies.
Responsibilities
- Lead design reviews and drive architecture within the team; decompose multi-service work into clear milestones.
- Define and own SLIs/SLOs; ensure post-incident actions land and reliability improves release-over-release.
- Implement advanced optimizations (e.g., micro-batch schedulers, speculative decoding, KV-cache reuse) and quantify impact.
- Strengthen incident posture: capacity planning, autoscaling policy, graceful degradation, rollback/traffic-shift strategies.
- Mentor IC1/IC2 engineers; review cross-team designs and elevate coding/testing standards.
Other
- ~3–5 years industry experience building distributed systems or cloud services.
- Leading multi-team initiatives or partnering with customers on mission-critical launches.
- Remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets.
- New hires will be invited to attend onboarding at one of our hubs within their first month.
- Teams also gather quarterly to support collaboration