The company is looking to take their frontier AI models from the lab into production-ready services by building high-performance inference infrastructure.
Requirements
- Strong experience in distributed systems and low-latency ML serving
- Skilled with performance optimization tools and techniques, and experienced in developing solutions for critical performance gains
- Hands-on with vLLM, SGLang, or equivalent frameworks
- Familiarity with GPU optimization, CUDA, and model parallelism
Responsibilities
- Architect and optimize high-performance inference infrastructure for large foundation models
- Benchmark and improve latency, throughput, and agent responsiveness
- Work with researchers to deploy new model architectures and multi-step agent behaviors
- Implement caching, batching, and prioritization to handle high-volume requests
- Build monitoring and observability into inference pipelines
Other
- Full-time, onsite role in Menlo Park
- Startup hours apply
- Comfort working in a high-velocity, ambiguity-heavy startup environment