Mercor collaborates with the world’s leading AI research labs to build and train cutting-edge AI models. This role is to develop, optimize, and benchmark CUDA kernels for tensor and operator workloads to improve AI model performance.
Requirements
- Deep expertise in CUDA, GPU architecture, and memory optimization
- Proven record of quantifiable performance improvements across hardware generations
- Proficiency with mixed precision, Tensor Core usage, and low-level numerical stability
- Familiarity with PyTorch, TensorFlow, or Triton (preferred but not required)
Responsibilities
- Develop, optimize, and benchmark CUDA kernels for tensor and operator workloads
- Tune for occupancy, memory coalescing, instruction-level parallelism, and optimal warp scheduling
- Profile and diagnose performance bottlenecks with tools such as Nsight Systems and Nsight Compute
- Report performance results, analyze speedups, and propose architectural improvements
- Integrate kernels with PyTorch and collaborate asynchronously with operator specialists
- Produce reproducible benchmarks and write comprehensive performance documentation
Other
- Strong communication and independent problem-solving skills
- Demonstrated contributions in open-source, research, or performance benchmarking
- Training support will be provided
- Hourly Contract
- Remote