NVIDIA seeks to improve the performance and efficiency of deep learning inference for AI applications by designing, building, and optimizing GPU-accelerated software
Requirements
- C/C++ programming and software design skills
- Experience with training, deploying or optimizing the inference of DL models in production is a plus
- Modeling, profiling, debug, and code optimization or architectural knowledge of CPU and GPU is a plus
- GPU programming experience (CUDA, OAI TRITON or CUTLASS) is a plus
- Python experience is a plus
- Experience with Multi GPU Communications (NCCL, NVSHMEM)
- SW Agile skills are helpful
Responsibilities
- Performance optimization, analysis, and tuning of DL models in various domains like LLM, Multimodal and Generative AI
- Scale performance of DL models across different architectures and types of NVIDIA accelerators
- Contribute features and code to NVIDIA’s inference libraries, vLLM and SGLang, FlashInfer and LLM software solutions
- Work with cross-collaborative teams across frameworks, NVIDIA libraries and inference optimization innovative solutions
- Implement the latest algorithms for public release in inference frameworks
- Identify and drive performance improvements for state-of-the-art LLM and Generative AI models across NVIDIA accelerators
- Bring to bear open-source tools and plugins—including CUTLASS, OAI Triton, NCCL, and CUDA kernels—to implement and optimize model serving pipelines
Other
- Pursuing a Masters or PhD or equivalent experience in relevant field (Computer Engineering, Computer Science, EECS, AI)
- Creative and autonomous engineer with a genuine passion for technology
- Ability to work in a diverse and inclusive work environment
- Commitment to fostering a diverse work environment and equal opportunity employer
- Location, experience, and pay of employees in similar positions will be considered for base salary determination