Make training large models across multi-GPU clusters materially faster and cheaper by leading kernel-level performance engineering.
Requirements
- hands-on GPU kernel work and shipped optimizations in production training or HPC
- Expert in modern C++ (C++17+) and at least one GPU programming model (CUDA, HIP, or SYCL/oneAPI) or a GPU kernel DSL (e.g., Triton); comfortable with templates, memory qualifiers, atomics, and warp/wave-level collectives
- Deep understanding of GPU microarchitecture: SIMT execution, occupancy vs. register/scratchpad pressure, memory hierarchy (global/L2/shared or LDS), coalescing, bank conflicts, vectorization, and instruction-level parallelism
- Proficiency with profiling & analysis: timelines and counters (e.g., Nsight Systems/Compute, rocprof/Omniperf, VTune/GPA or equivalents), ISA/disassembly inspection, and correlating metrics to code changes
- Proven track record reducing time-to-train or $-per-step via kernel and collective-comms optimizations on multi-GPU clusters
- Strong Linux fundamentals (perf/eBPF, NUMA, PCIe/links), build systems (CMake/Bazel), Python, and containerized dev (Docker/Podman)
- Experience with distributed training (PyTorch DDP/FSDP/ZeRO/DeepSpeed or JAX) and GPU collectives
Responsibilities
- Own kernel performance: Design, implement, and land high-impact HIP/C++ kernels (e.g., attention, layernorm, softmax, GEMM/epilogues, fused pointwise) that are wave-size portable and optimized for LDS, caches, and MFMA units.
- Lead profiling & tuning: Build repeatable workflows with timelines, hardware counters, and roofline analysis; remove memory bottlenecks; tune launch geometry/occupancy; validate speedups with A/B harnesses.
- Drive fusion & algorithmic improvements: Identify profitable fusions, tiling strategies, vectorized I/O, shared-memory/scratchpad layouts, asynchronous pipelines, and warp/wave-level collectives—while maintaining numerical stability.
- Influence frameworks & libraries: Upstream or extend performance-critical ops in PyTorch/JAX/XLA/Triton; evaluate and integrate vendor math libraries; guide compiler/codegen choices for target architectures
- Scale beyond one GPU: Optimize P2P and collective comms, overlap compute/comm, and improve data/pipeline/tensor parallelism throughput across nodes
- Benchmarking & SLOs: Define and own KPIs (throughput, time-to-train, $/step, energy/step); maintain dashboards, perf CI gates, and regression triage
- Technical leadership: Mentor senior engineers, set coding/perf standards, lead performance “war rooms,” and partner with silicon/vendor teams on microarchitecture-aware optimizations
Other
- Partnering with researchers, framework teams, and infrastructure.
- Partner with silicon/vendor teams on microarchitecture-aware optimizations
- Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
- HYBRID
- San Jose, CA