NVIDIA is looking to expand Megatron Core and NeMo Framework's capabilities to enable users to develop, train, and optimize Large Language Models (LLM) and Multimodal (MM) foundation models
Requirements
- Experience with AI Frameworks (e.g. PyTorch, JAX), and/or inference and deployment environments (e.g. TRTLLM, vLLM, SGLang)
- Proficient in Python programming, software design, debugging, performance analysis, test design and documentation
- Strong understanding of AI/Deep-Learning fundamentals and their practical applications
- Hands-on experience in large-scale AI training, with a deep understanding of core compute system concepts (such as latency/throughput bottlenecks, pipelining, and multiprocessing) and demonstrated excellence in related performance analysis and tuning
- Expertise in distributed computing, model parallelism, and mixed precision training
- Prior experience with Generative AI techniques applied to LLM and Multi-Modal learning (Text, Image, and Video)
- Knowledge of GPU/CPU architecture and related numerical software
Responsibilities
- Design and develop the GenAI open source Megatron Core and NeMo Framework
- Solve large-scale, end-to-end AI training and inference challenges, spanning the full model lifecycle from initial orchestration, data pre-processing, and running of model training and tuning, to model deployment
- Work at the intersection of AI applications, libraries, frameworks, and the entire software stack
- Innovate and improve model architectures, distributed training algorithms, and model parallel paradigms
- Accelerate foundation model training and finetuning with mixed precision recipes and next-gen NVIDIA GPU architectures
- Performance tuning and optimizations of deep learning framework and software components
- Research, prototype, and develop robust and scalable AI tools and pipelines
Other
- MS, PhD or equivalent experience in Computer Science, AI, Applied Math, or related fields and 5+ years of industry experience
- Consistent record of working effectively across multiple engineering initiatives and improving AI libraries with new innovations
- Base salary range is 148,000 USD - 235,750 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4
- Eligible for equity and benefits
- Applications for this job will be accepted at least until October 3, 2025