NVIDIA is looking to solve the problem of efficient, scalable inference for large language and reasoning models in distributed GPU environments with the Dynamo platform
Requirements
- Strong proficiency in systems programming (Rust and/or C++), with experience in Python for workflow and API development.
- Experience with Go for Kubernetes controllers and operators development.
- Deep understanding of distributed systems, parallel computing, and GPU architectures.
- Experience with cloud-native deployment and container orchestration (Kubernetes, Docker).
- Experience with large-scale inference serving, LLMs, or similar high-performance AI workloads.
- Background with memory management, data transfer optimization, and multi-node orchestration.
- Familiarity with open-source development workflows (GitHub, continuous integration and continuous deployment).
Responsibilities
- Dynamo k8s Serving Platform: Build the Kubernetes deployment and workload management stack for Dynamo to facilitate inference deployments at scale. Identify bottlenecks and apply optimization techniques to fully use hardware capacity.
- Scalability & Reliability: Develop robust, production-grade inference workload management systems that scale from a handful to thousands of GPUs, supporting a variety of LLM frameworks (e.g., TensorRT-LLM, vLLM, SGLang).
- Disaggregated Serving: Architect and optimize the separation of prefill (context ingestion) and decode (token generation) phases across distinct GPU clusters to improve throughput and resource utilization. Contribute to embedding disaggregation for multi-modal models (Vision-Language models, Audio Language Models, Video Language Models).
- Dynamic GPU Scheduling: Develop and refine Planner algorithms for real-time allocation and rebalancing of GPU resources based on fluctuating workloads and system bottlenecks, ensuring peak performance at scale.
- Intelligent Routing: Enhance the smart routing system to efficiently direct inference requests to GPU worker replicas with relevant KV cache data, minimizing re-computation and latency for sophisticated, multi-step reasoning tasks.
- Distributed KV Cache Management: Innovate in the management and transfer of large KV caches across heterogeneous memory and storage hierarchies, using the NVIDIA Optimized Transfer Library (NIXL) for low-latency, cost-effective data movement.
- Collaborate on the design and development of the Dynamo Kubernetes stack.
Other
- BS/MS or higher in computer engineering, computer science or related engineering (or equivalent experience).
- 15+ years of proven experience in related field.
- Excellent problem-solving and communication skills.
- Collaborate with the community to address issues, capture feedback, and evolve the framework’s APIs and architecture.
- Write clear documentation and contribute to user and developer guides.