Design device drivers and systems software for emerging ML hardware in the high-performance communication space, unlocking greater AI capability while dramatically improving efficiency.
Requirements
Deep experience with low-level systems programming.
Knowledge of OS internals, especially PCIe/IO sub-systems and memory management.
Prior experience in accelerator programming (e.g. CUDA, JAX/Pallas, ROCm).
Prior experience with collective communication libraries (e.g. nccl).
Experience with GPUDirect and RDMA is a strong plus.
Responsibilities
Develop low-latency, high-throughput data exchange systems between GPUs;
Develop high-performance data movement kernels;
Define and expose data movement interfaces to high-level ML frameworks (e.g. PyTorch);
Develop Linux device drivers for custom hardware.
Other
This role will be performed onsite from one of our offices in Santa Clara, CA or Boston, MA.
Hacker mentality.
Relocation assistance and visa sponsorship.
A collaborative, continuous-learning work environment with smart, dedicated colleagues engaged in developing the next generation of architecture for high-performance computing.
We value thoughtful disagreement, fast learning, and intellectual fearlessness.