NVIDIA's nvFuser team is looking to build the next-generation fusion compiler that automatically optimizes deep learning models for workloads scaling to thousands of GPUs, impacting the future of AI compilation.
Requirements
- CUDA kernel optimization
- C++ systems programming
- Compiler infrastructure
- Parallel programming
- Systems-level performance work
- Advanced C++ programming with large codebase development, template meta-programming, and performance-critical code
- Strong parallel programming experience with multi-threading, OpenMP, CUDA, MPI, NCCL, NVSHMEM, or other parallel computing technologies
Responsibilities
- Design algorithms that generate highly optimized code from deep learning programs
- Build GPU-aware CPU runtime systems that coordinate kernel execution for maximum performance
- Master the latest GPU architectures
- Develop innovative techniques for emerging AI workloads
- Debug performance bottlenecks in thousand-GPU distributed systems
- Influence next-generation hardware design
- Push the boundaries of what's possible in AI compilation
Other
- MS or PhD in Computer Science, Computer Engineering, Electrical Engineering, or related field (or equivalent experience).
- Shown experience with low-level performance optimization and systematic bottleneck identification beyond basic profiling.
- Performance analysis skills: experience analyzing high-level programs to identify performance bottlenecks and develop optimization strategies.
- Collaborative problem-solving approach with adaptability in ambiguous situations, first-principles based thinking, and a sense of ownership.
- Excellent verbal and written communication skills.