Unlocking greater AI capability while dramatically improving efficiency at the infrastructure layer for LLM inference systems.
Requirements
Prior experience contributing to the core LLM inference infrastructures (vLLM, SGLang, TensorRT, etc.).
Prior experience in accelerator programming (e.g. CUDA, JAX/Pallas, ROCm).
Advanced computer architectures and performance engineering skills is a big plus.
Responsibilities
Prototype and optimize emerging ML inference systems.
Develop novel memory models for expandable vRAM.
Write efficient GPU kernels for data movement.
Perform design-space exploration, implementation, and benchmarking of inference engines, both in simulations and on real hardware.
Other
This role is part engineering, part research
This role will be performed on-site from one of our offices in Santa Clara, CA or Boston, MA.
Relocation assistance and visa sponsorship.
A collaborative, continuous-learning work environment with smart, dedicated colleagues engaged in developing the next generation of architecture for high-performance computing.
We value thoughtful disagreement, fast learning, and intellectual fearlessness.