OpenAI's Compute Runtime team needs to build low-level framework components for ML training systems to maximize researcher and hardware productivity, accelerating progress towards AGI.
Requirements
- Have worked on large distributed systems
- Love figuring out how systems work and continuously come up with ideas for how to make them faster while minimizing complexity and maintenance burden
- Have strong software engineering skills and are proficient in Python and Rust or equivalent.
Responsibilities
- Work across our Python and Rust stack
- Profile and optimize and help design for scale our compute and data capabilities
- Work on deploying our training framework to our latest supercomputers rapidly responding to the changing shapes and needs of the ML systems.
Other
- This role is based in San Francisco, CA.
- We use a hybrid work model of 3 days in the office per week and offer relocation assistance to new employees.