Building distributed training infrastructure to power a frontier-scale superintelligence platform for breakthrough AI research at a cutting-edge company
Requirements
- Proven experience with distributed ML training frameworks
- Strong engineering background in Python and C++
- Understanding of large-scale model training techniques
- Experience in cloud or HPC environments
Responsibilities
- creating the systems that make breakthrough AI research possible
- working on large-scale training infrastructure for LLMs and multimodal models
- building the backbone for models that generate new knowledge across multiple domains
- designing distributed training systems
- performance optimisation
- building scalable pipelines that enable complex experiments to run across thousands of GPUs
Other
- Onsite in San Francisco, CA or Boston, MA
- Full benefits
- Degree requirements not specified, but strong engineering background required