OpenAI is looking to solve the problem of efficiently training their flagship AI models on custom-built supercomputers by optimizing the collective communication stack.
Requirements
- low-level systems work
- low-level systems engineering such as CPU/GPU kernels, RDMA, high-performance networking, or HPC.
- NCCL (NVIDIA Collective Communications Library) or collectives experience.
Responsibilities
- Manage a highly senior team that develops the communication and systems software behind OpenAI’s largest training workloads.
- Collaborate closely with ML research and infrastructure teams to ensure system priorities align with evolving model needs.
- Grow and support engineers on your team through mentorship, project alignment, and performance development.
- Prioritize across projects and maintain visibility into incoming research demands to keep critical training infrastructure ahead of bottlenecks.
Other
- We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.
- Are an experienced leader
- Are excited to manage a deeply technical team and guide systems work that directly enables AI research at massive scale.
- Enjoy working closely with other high-context teams across research and infrastructure to solve complex, cross-cutting problems.