Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

ML Solutions Engineer

TensorWave

Salary not specified

Nov 17, 2025

Las Vegas, NV, US

TensorWave is looking to solve the problem of enabling and optimizing AI workloads, specifically migrating CUDA-based workloads to run efficiently on AMD hardware using ROCm, to drive the next generation of AI innovation on their versatile cloud platform.

Requirements

Strong hands-on experience with CUDA, HIP, and ROCm.
Proficiency in kernel development (e.g., CUDA, HIP, Composable Kernel, Triton).
Deep knowledge of GPU performance profiling tools (Nsight, rocprof, perf, etc.).
Understanding of distributed ML workloads (e.g., PyTorch Distributed, MPI, RCCL).
Strong programming skills in Python, C++, and GPU kernel languages.
Contributions to ROCm-enabled open source ML frameworks (PyTorch, Megatron, vLLM, SGLang, etc.).
Familiarity with compiler technology (LLVM, MLIR, XLA).

Responsibilities

Partner with customers, internal engineering, and third-party developers to migrate CUDA workloads to ROCm.
Profile, debug, and optimize GPU kernels for performance, scalability, and efficiency.
Contribute to ROCm enablement across open source ML frameworks and libraries.
Leverage tools such as Composable Kernel, HIP, PyTorch/XLA, and RCCL to enable and tune distributed training workloads.
Provide technical guidance on best practices for GPU portability, including kernel-level optimizations, mixed precision, and memory hierarchy usage.
Act as a technical liaison, translating customer requirements into actionable engineering work.
Create internal documentation, playbooks, and training material to scale knowledge across teams.

Other

Proven ability to work in customer-facing technical roles, including solution design and workload migration.
Represent TensorWave in the broader ROCm ecosystem through contributions, collaboration, and customer advocacy.
Customers successfully migrate and optimize their CUDA workloads to ROCm, with measurable performance gains.
Strong collaboration between internal engineering and external developers leads to faster enablement of ROCm workloads.
Best practices, playbooks, and tooling are well-documented and continuously improved.