Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

NVIDIA Logo

Senior Software Engineer, Profiling Services

NVIDIA

$184,000 - $356,500
Oct 30, 2025
Santa Clara, CA, US
Apply Now

Innovate GPU performance analysis for Machine Learning workloads by designing, implementing, and leading an Always-On Profiling service.

Requirements

  • 8+ years of meaningful software development experience in C, C++, and Python
  • 10+ years in system software design, operating systems fundamentals, computer architectures, performance analysis, and delivering production-quality software.
  • Profiling & Performance Tools Expert: Extensive knowledge of profiling technologies (sampling, tracing), overhead analysis, and diverse profiling data (CPU/GPU events, performance counters, API traces, event correlation). Familiarity with existing profiling ecosystems and their limitations is a plus.
  • GPU & CUDA Proficiency: In-depth knowledge of CUDA APIs, runtime, streams, kernels, and GPU architecture.
  • ML Ecosystem & Performance Analysis: Familiarity with ML frameworks such as PyTorch and JAX, and knowledge of performance analysis for AI training/inference applications.
  • Large-Scale System Development & Debugging: Experience developing and debugging across complex multi-layered software systems, including user mode and kernel drivers, with a proven ability to contribute to and extend substantial codebases (100s of millions of lines).
  • Proficiency in Designing APIs and Interfaces for Profiling Tools: Designs robust, flexible APIs and interfaces enabling seamless integration of profiling tools with various frameworks and custom code.

Responsibilities

  • Architect and Build Scalable Systems: Drive the design and implementation of the AON profiling service's core systems. You'll master inter-process communication (IPC), memory management, and building low-overhead architectures to handle profiling data from complex multi-node, multi-process, multi-GPU, and cluster environments.
  • Elevate Software Engineering Excellence: Promote high standards in software development, including design patterns, concurrency, parallelism, and advanced debugging for asynchronous systems. Our commitment to code quality and robust testing ensures a reliable profiling service.
  • Lead, Mentor, and Innovate: Guide and mentor engineers, provides impactful code reviews, and shape technical roadmaps. Proactively identify complex technical issues within the AON project, break them down, and craft innovative solutions. Your problem-solving prowess will be crucial for AON's success with ML workloads.
  • Architect and Build High-Performance Platforms: Transform user needs into clear requirements and design documents. Explore diverse approaches to problems, making well-reasoned recommendations. Lead end-to-end feature development—from planning and prototyping to implementation, testing, and customer evaluation. This involves hands-on development across user applications, drivers, performance counter libraries, and lower-level platform/hardware abstraction layers.
  • Collaborate Across Boundaries: Partner effectively with diverse internal and external teams. Exceptional communication and collaboration skills are key to integrating AON seamlessly into the broader profiling and ML ecosystem.
  • Designs robust, flexible APIs and interfaces enabling seamless integration of profiling tools with various frameworks and custom code.
  • A history of breaking down ill-defined problems in complex technical domains, designing effective solutions, and leading teams to implement them.

Other

  • Lead, Mentor, and Innovate: Guide and mentor engineers, provides impactful code reviews, and shape technical roadmaps.
  • Strong interpersonal, verbal, and written communication, demonstrating the ability to build cross-organizational partnerships and lead technical teams through complex challenges.
  • Collaborate Across Boundaries: Partner effectively with diverse internal and external teams. Exceptional communication and collaboration skills are key to integrating AON seamlessly into the broader profiling and ML ecosystem.
  • Translating Customer Needs: Skilled at redefining customer requests into actionable use cases and requirements.
  • Strong understanding of system security principles.