NVIDIA seeks to improve the performance and efficiency of deep learning inference for AI applications by designing, building, and optimizing GPU-accelerated software
Requirements
- Excellent C/C++ programming and software design skills
- Python experience is a plus
- Prior experience with training, deploying or optimizing the inference of DL models in production is a plus
- Prior background with performance modeling, profiling, debug, and code optimization or architectural knowledge of CPU and GPU is a plus
- GPU programming experience (CUDA, OAI TRITON or CUTLASS) is a plus
- Experience with Multi GPU Communications (NCCL, NVSHMEM) is a plus
- Experience with deep learning frameworks like PyTorch, vLLM, and SGLang is a plus
Responsibilities
- Performance optimization, analysis, and tuning of DL models in various domains like LLM, Multimodal and Generative AI
- Scale performance of DL models across different architectures and types of NVIDIA accelerators
- Contribute features and code to NVIDIA’s inference libraries, vLLM and SGLang, FlashInfer and LLM software solutions
- Work with cross-collaborative teams across frameworks, NVIDIA libraries and inference optimization innovative solutions
- Implement the latest algorithms for public release in frameworks like SGLang and vLLM
- Identify and drive performance improvements for state-of-the-art LLM and Generative AI models across NVIDIA accelerators
- Implement and optimize model serving pipelines using open-source tools and plugins
Other
- Masters or PhD or equivalent experience in relevant field (Computer Engineering, Computer Science, EECS, AI)
- 5+ years of relevant software development experience
- SW Agile skills are helpful
- Travel requirements not specified
- Degree requirements: Masters or PhD or equivalent experience