NVIDIA is looking for an experienced Senior Software Technical Program Manager to lead efforts in developing pioneering compute software solutions for critically important environments, impacting various fields and used across leading academic institutions, start-ups, and industry. The role will focus on leading and managing communication libraries like NCCL, NVSHMEM, UCX for Deep Learning and HPC.
Requirements
- Hands on experience with software development for hardware platforms or communication runtime or high performance networking with demonstrated success in delivering these complex products to customers.
- Proficiency in Agile software development methodologies.
- Comprehensive understanding of software engineering principles, including experience with widely-adopted configuration management tools and productivity-enhancing tools and automation processes.
- Background with parallel programming models (MPI, SHMEM) and at least one communication runtime (MPI, NCCL, NVSHMEM, OpenSHMEM, UCX, UCC).
- Knowledge of a modern programming language is desired as well as depth in HPC and ML/DL fundamentals
- Background with RDMA, high-performance networking technologies (InfiniBand, RoCE, Ethernet, EFA), network architecture and network topologies.
Responsibilities
- Responsible for leading status meetings, proactively addressing challenges, customer concerns, and serving as primary POC for building and upholding prioritized release schedules and plans.
- Strategically plan and partner across Nvidia teams to drive software objectives while maintaining schedules and formulating risk management strategies for risks identified across multiple parallel work streams.
- Lead existing product development enhancements and software release processes, while collaborating with engineering management to optimize the development workflow and efficiency.
- Translate customer requirements into actionable landmarks and tasks internally, ensuring customers are continually informed on issue statuses.
- Drive Virtual reviews and establish continuous feedback loops by communicating benchmarking results and customer insights to product and engineering leadership.
- Track and report large-scale performance benchmarking across all clusters. Build performance dashboards and reporting processes to monitor KPIs and surface performance trends
- Collaborate across internal teams and third-party partners across time zones, as necessary, to resolve customer issues and oversee customer releases.
Other
- 12+ overall years of experience in the software industry with specialization in HPC networking or system software.
- 6+ years program management experience in a similar or related role.
- Proven experience to creatively resolve technical and resource issues, and think strategically and tactically building consensus to ensure program success
- Exceptional attention to detail and a demonstrated capacity for multitasking, in a dynamic environment with shifting priorities and changing requirements.
- Strong communication and technical presentation skills and ability to work independently and actively with minimal guidance.
- Previous experience coordinating activities between HW and SW organizations
- Solid understanding of the Deep Learning Framework ecosystem for Training and Inference
- Solid understanding of operating systems, datacenter servers, graphics principles and standards.