OpenAI's Training Runtime team is looking to improve the training throughput for their internal training framework, enabling researchers to experiment with new ideas and develop the next generation of models more efficiently. This involves optimizing the performance of large-scale distributed machine learning training.
Requirements
- Have run small scale ML experiments
- Have strong software engineering skills and are proficient in Python
- optimizing performance
- understanding distributed systems
- writing bug-free machine learning code
- acquiring deep knowledge of the performance of supercomputers
- designing, implementing, and optimizing state-of-the-art AI models
Responsibilities
- Apply the latest techniques in our internal training framework to achieve impressive hardware efficiency for our training runs
- Profile and optimize our training framework
- Work with researchers to enable them to develop the next generation of models
- designs the core distributed machine-learning training runtime
- building a unified, modular runtime that meets researchers where they are and moves with them up the scaling curve
- high-performance, asynchronous, zero-copy tensor and optimizer-state-aware data movement
- performant, high-uptime, fault-tolerant training frameworks (training loop, state management, resilient checkpointing, deterministic orchestration, and observability)
Other
- This role is based in San Francisco, CA.
- We use a hybrid work model of 3 days in the office per week
- offer relocation assistance to new employees.
- We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.
- Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law