Together AI is building the Inference Platform to bring advanced generative AI models to the world, powering multi-tenant serverless workloads and dedicated endpoints. The challenge is to optimize latency, fully utilize hardware resources (GPUs), and efficiently serve diverse workloads at scale.
Requirements
- 5+ years of demonstrated experience building large-scale, fault-tolerant, distributed systems and API microservices.
- Strong background in designing, analyzing, and improving efficiency, scalability, and stability of complex systems.
- Excellent understanding of low-level OS concepts: multi-threading, memory management, networking, and storage performance.
- Expert-level programming in one or more of: Rust, Go, Python, or TypeScript.
- Knowledge of modern LLMs and generative models and how they are served in production is a plus.
- Experience working with the open source ecosystem around inference is highly valuable; familiarity with SGLang, vLLM, or NVIDIA Dynamo will be especially handy.
- Experience with Kubernetes or container orchestration is a strong plus.
Responsibilities
- Build and optimize global and local request routing, ensuring low-latency load balancing across data centers and model engine pods.
- Develop auto-scaling systems to dynamically allocate resources and meet strict SLOs across dozens of data centers.
- Design systems for multi-tenant traffic shaping, tuning both resource allocation and request handling — including smart rate limiting and regulation — to ensure fairness and consistent experience across all users.
- Engineer trade-offs between latency and throughput to serve diverse workloads efficiently.
- Optimize prefix caching to reduce model compute and speed up responses.
- Collaborate with ML researchers to bring new model architectures into production at scale.
- Continuously profile and analyze system-level performance to identify bottlenecks and implement optimizations.
Other
- Shape the core inference backbone that powers Together AI’s frontier models.
- Solve performance-critical challenges in global request routing, load balancing, and large-scale resource allocation.
- Work with state-of-the-art accelerators (H100s, H200s, GB200s) at global scale.
- Partner with world-class researchers to bring new model architectures into production.
- Collaborate with and contribute to the open source community, shaping the tools that advance the industry.