Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

AMD Logo

Principal/Senior GPU Software Performance Engineer — Training at Scale

AMD

Salary not specified
Oct 24, 2025
San Jose, CA, US
Apply Now

Make training large models across multi-GPU clusters materially faster and cheaper by leading kernel-level performance engineering.

Requirements

  • hands-on GPU kernel work and shipped optimizations in production training or HPC
  • Expert in modern C++ (C++17+) and at least one GPU programming model (CUDA, HIP, or SYCL/oneAPI) or a GPU kernel DSL (e.g., Triton); comfortable with templates, memory qualifiers, atomics, and warp/wave-level collectives
  • Deep understanding of GPU microarchitecture: SIMT execution, occupancy vs. register/scratchpad pressure, memory hierarchy (global/L2/shared or LDS), coalescing, bank conflicts, vectorization, and instruction-level parallelism
  • Proficiency with profiling & analysis: timelines and counters (e.g., Nsight Systems/Compute, rocprof/Omniperf, VTune/GPA or equivalents), ISA/disassembly inspection, and correlating metrics to code changes
  • Proven track record reducing time-to-train or $-per-step via kernel and collective-comms optimizations on multi-GPU clusters
  • Strong Linux fundamentals (perf/eBPF, NUMA, PCIe/links), build systems (CMake/Bazel), Python, and containerized dev (Docker/Podman)
  • Experience with distributed training (PyTorch DDP/FSDP/ZeRO/DeepSpeed or JAX) and GPU collectives

Responsibilities

  • Own kernel performance: Design, implement, and land high-impact HIP/C++ kernels (e.g., attention, layernorm, softmax, GEMM/epilogues, fused pointwise) that are wave-size portable and optimized for LDS, caches, and MFMA units.
  • Lead profiling & tuning: Build repeatable workflows with timelines, hardware counters, and roofline analysis; remove memory bottlenecks; tune launch geometry/occupancy; validate speedups with A/B harnesses.
  • Drive fusion & algorithmic improvements: Identify profitable fusions, tiling strategies, vectorized I/O, shared-memory/scratchpad layouts, asynchronous pipelines, and warp/wave-level collectives—while maintaining numerical stability.
  • Influence frameworks & libraries: Upstream or extend performance-critical ops in PyTorch/JAX/XLA/Triton; evaluate and integrate vendor math libraries; guide compiler/codegen choices for target architectures
  • Scale beyond one GPU: Optimize P2P and collective comms, overlap compute/comm, and improve data/pipeline/tensor parallelism throughput across nodes
  • Benchmarking & SLOs: Define and own KPIs (throughput, time-to-train, $/step, energy/step); maintain dashboards, perf CI gates, and regression triage
  • Technical leadership: Mentor senior engineers, set coding/perf standards, lead performance “war rooms,” and partner with silicon/vendor teams on microarchitecture-aware optimizations

Other

  • Partnering with researchers, framework teams, and infrastructure.
  • Partner with silicon/vendor teams on microarchitecture-aware optimizations
  • Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent
  • HYBRID
  • San Jose, CA