TensorWave is looking to solve the problem of enabling and optimizing AI workloads, specifically migrating CUDA-based workloads to run efficiently on AMD hardware using ROCm, to drive the next generation of AI innovation on their versatile cloud platform.
Requirements
- Strong hands-on experience with CUDA, HIP, and ROCm.
- Proficiency in kernel development (e.g., CUDA, HIP, Composable Kernel, Triton).
- Deep knowledge of GPU performance profiling tools (Nsight, rocprof, perf, etc.).
- Understanding of distributed ML workloads (e.g., PyTorch Distributed, MPI, RCCL).
- Strong programming skills in Python, C++, and GPU kernel languages.
- Contributions to ROCm-enabled open source ML frameworks (PyTorch, Megatron, vLLM, SGLang, etc.).
- Familiarity with compiler technology (LLVM, MLIR, XLA).
Responsibilities
- Partner with customers, internal engineering, and third-party developers to migrate CUDA workloads to ROCm.
- Profile, debug, and optimize GPU kernels for performance, scalability, and efficiency.
- Contribute to ROCm enablement across open source ML frameworks and libraries.
- Leverage tools such as Composable Kernel, HIP, PyTorch/XLA, and RCCL to enable and tune distributed training workloads.
- Provide technical guidance on best practices for GPU portability, including kernel-level optimizations, mixed precision, and memory hierarchy usage.
- Act as a technical liaison, translating customer requirements into actionable engineering work.
- Create internal documentation, playbooks, and training material to scale knowledge across teams.
Other
- Proven ability to work in customer-facing technical roles, including solution design and workload migration.
- Represent TensorWave in the broader ROCm ecosystem through contributions, collaboration, and customer advocacy.
- Customers successfully migrate and optimize their CUDA workloads to ROCm, with measurable performance gains.
- Strong collaboration between internal engineering and external developers leads to faster enablement of ROCm workloads.
- Best practices, playbooks, and tooling are well-documented and continuously improved.