Innovate GPU performance analysis for Machine Learning workloads by designing, implementing, and leading an Always-On Profiling service.
Requirements
- 8+ years of meaningful software development experience in C, C++, and Python
- 10+ years in system software design, operating systems fundamentals, computer architectures, performance analysis, and delivering production-quality software.
- Profiling & Performance Tools Expert: Extensive knowledge of profiling technologies (sampling, tracing), overhead analysis, and diverse profiling data (CPU/GPU events, performance counters, API traces, event correlation). Familiarity with existing profiling ecosystems and their limitations is a plus.
- GPU & CUDA Proficiency: In-depth knowledge of CUDA APIs, runtime, streams, kernels, and GPU architecture.
- ML Ecosystem & Performance Analysis: Familiarity with ML frameworks such as PyTorch and JAX, and knowledge of performance analysis for AI training/inference applications.
- Large-Scale System Development & Debugging: Experience developing and debugging across complex multi-layered software systems, including user mode and kernel drivers, with a proven ability to contribute to and extend substantial codebases (100s of millions of lines).
- Proficiency in Designing APIs and Interfaces for Profiling Tools: Designs robust, flexible APIs and interfaces enabling seamless integration of profiling tools with various frameworks and custom code.
Responsibilities
- Architect and Build Scalable Systems: Drive the design and implementation of the AON profiling service's core systems. You'll master inter-process communication (IPC), memory management, and building low-overhead architectures to handle profiling data from complex multi-node, multi-process, multi-GPU, and cluster environments.
- Elevate Software Engineering Excellence: Promote high standards in software development, including design patterns, concurrency, parallelism, and advanced debugging for asynchronous systems. Our commitment to code quality and robust testing ensures a reliable profiling service.
- Lead, Mentor, and Innovate: Guide and mentor engineers, provides impactful code reviews, and shape technical roadmaps. Proactively identify complex technical issues within the AON project, break them down, and craft innovative solutions. Your problem-solving prowess will be crucial for AON's success with ML workloads.
- Architect and Build High-Performance Platforms: Transform user needs into clear requirements and design documents. Explore diverse approaches to problems, making well-reasoned recommendations. Lead end-to-end feature development—from planning and prototyping to implementation, testing, and customer evaluation. This involves hands-on development across user applications, drivers, performance counter libraries, and lower-level platform/hardware abstraction layers.
- Collaborate Across Boundaries: Partner effectively with diverse internal and external teams. Exceptional communication and collaboration skills are key to integrating AON seamlessly into the broader profiling and ML ecosystem.
- Designs robust, flexible APIs and interfaces enabling seamless integration of profiling tools with various frameworks and custom code.
- A history of breaking down ill-defined problems in complex technical domains, designing effective solutions, and leading teams to implement them.
Other
- Lead, Mentor, and Innovate: Guide and mentor engineers, provides impactful code reviews, and shape technical roadmaps.
- Strong interpersonal, verbal, and written communication, demonstrating the ability to build cross-organizational partnerships and lead technical teams through complex challenges.
- Collaborate Across Boundaries: Partner effectively with diverse internal and external teams. Exceptional communication and collaboration skills are key to integrating AON seamlessly into the broader profiling and ML ecosystem.
- Translating Customer Needs: Skilled at redefining customer requests into actionable use cases and requirements.
- Strong understanding of system security principles.