Strengthen the performance and scalability of our distributed training infrastructure and streamline the development and execution of large-scale training runs.
Requirements
- Experience with large-scale ML training pipelines and distributed training frameworks
- Strong software engineering skills in python
- Passion for diving deep into systems implementations and understanding fundamentals to improve their performance and maintainability
- Experience improving resource efficiency across distributed computing environments by leveraging profiling, benchmarking, and implementing system-level optimizations
Responsibilities
- Collaborate with researchers to enable them to develop systems-efficient models and architectures
- Apply the latest techniques to our internal training runs to achieve impressive hardware efficiency for our training runs
- Create tooling to help researchers distribute their training jobs more effectively
- Profile and optimize our training runs
Other
- This position is a great fit for someone who enjoys working at the intersection of distributed systems and machine learning, values high-performance code, and has an interest in supporting innovative machine learning efforts.