NVIDIA is looking to develop pioneering compute software solutions for critically important environments, including leading academic institutions, start-ups, and industry, by leading and managing communication libraries like NCCL, NVSHMEM, UCX for Deep Learning and HPC.
Requirements
- Hands on experience with software development for hardware platforms or communication runtime or high performance networking with demonstrated success in delivering these complex products to customers.
- Proficiency in Agile software development methodologies.
- Comprehensive understanding of software engineering principles, including experience with widely-adopted configuration management tools and productivity-enhancing tools and automation processes.
- Background with parallel programming models (MPI, SHMEM) and at least one communication runtime (MPI, NCCL, NVSHMEM, OpenSHMEM, UCX, UCC).
- Knowledge of a modern programming language is desired as well as depth in HPC and ML/DL fundamentals
- Background with RDMA, high-performance networking technologies (InfiniBand, RoCE, Ethernet, EFA), network architecture and network topologies.
- Solid understanding of the Deep Learning Framework ecosystem for Training and Inference
Responsibilities
- Responsible for leading status meetings, proactively addressing challenges, customer concerns, and serving as primary POC for building and upholding prioritized release schedules and plans.
- Strategically plan and partner across Nvidia teams to drive software objectives while maintaining schedules and formulating risk management strategies for risks identified across multiple parallel work streams.
- Lead existing product development enhancements and software release processes, while collaborating with engineering management to optimize the development workflow and efficiency.
- Translate customer requirements into actionable landmarks and tasks internally, ensuring customers are continually informed on issue statuses.
- Drive Virtual reviews and establish continuous feedback loops by communicating benchmarking results and customer insights to product and engineering leadership.
- Track and report large-scale performance benchmarking across all clusters. Build performance dashboards and reporting processes to monitor KPIs and surface performance trends
- Collaborate across internal teams and third-party partners across time zones, as necessary, to resolve customer issues and oversee customer releases.
Other
- BS, MS, or Ph.D. in CS, CE, EE (related technical field) or equivalent experience.
- 12+ overall years of experience in the software industry with specialization in HPC networking or system software.
- 6+ years program management experience in a similar or related role.
- Strong communication and technical presentation skills and ability to work independently and actively with minimal guidance.
- Previous experience coordinating activities between HW and SW organizations