Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Principal/Senior GPU Software Performance Engineer — Training at Scale

Salary not specified

Oct 24, 2025

San Jose, CA, US

Make training large models across multi-GPU clusters materially faster and cheaper by leading kernel-level performance engineering.

hands-on GPU kernel work and shipped optimizations in production training or HPC
Expert in modern C++ (C++17+) and at least one GPU programming model (CUDA, HIP, or SYCL/oneAPI) or a GPU kernel DSL (e.g., Triton); comfortable with templates, memory qualifiers, atomics, and warp/wave-level collectives
Deep understanding of GPU microarchitecture: SIMT execution, occupancy vs. register/scratchpad pressure, memory hierarchy (global/L2/shared or LDS), coalescing, bank conflicts, vectorization, and instruction-level parallelism
Proficiency with profiling & analysis: timelines and counters (e.g., Nsight Systems/Compute, rocprof/Omniperf, VTune/GPA or equivalents), ISA/disassembly inspection, and correlating metrics to code changes
Proven track record reducing time-to-train or $-per-step via kernel and collective-comms optimizations on multi-GPU clusters
Strong Linux fundamentals (perf/eBPF, NUMA, PCIe/links), build systems (CMake/Bazel), Python, and containerized dev (Docker/Podman)
Experience with distributed training (PyTorch DDP/FSDP/ZeRO/DeepSpeed or JAX) and GPU collectives

Own kernel performance: Design, implement, and land high-impact HIP/C++ kernels (e.g., attention, layernorm, softmax, GEMM/epilogues, fused pointwise) that are wave-size portable and optimized for LDS, caches, and MFMA units.
Lead profiling & tuning: Build repeatable workflows with timelines, hardware counters, and roofline analysis; remove memory bottlenecks; tune launch geometry/occupancy; validate speedups with A/B harnesses.
Drive fusion & algorithmic improvements: Identify profitable fusions, tiling strategies, vectorized I/O, shared-memory/scratchpad layouts, asynchronous pipelines, and warp/wave-level collectives—while maintaining numerical stability.
Influence frameworks & libraries: Upstream or extend performance-critical ops in PyTorch/JAX/XLA/Triton; evaluate and integrate vendor math libraries; guide compiler/codegen choices for target architectures
Scale beyond one GPU: Optimize P2P and collective comms, overlap compute/comm, and improve data/pipeline/tensor parallelism throughput across nodes
Benchmarking & SLOs: Define and own KPIs (throughput, time-to-train, $/step, energy/step); maintain dashboards, perf CI gates, and regression triage
Technical leadership: Mentor senior engineers, set coding/perf standards, lead performance “war rooms,” and partner with silicon/vendor teams on microarchitecture-aware optimizations

Partnering with researchers, framework teams, and infrastructure.
Partner with silicon/vendor teams on microarchitecture-aware optimizations
Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
HYBRID
San Jose, CA