Scale is looking to build and optimize their internal distributed framework for large language model training and inference.
Requirements
Experience with multi-node LLM training and inference
Experience with developing large-scale distributed ML systems
Strong software engineering skills, proficient in frameworks and tools such as CUDA, Pytorch, transformers, flash attention, etc.
Demonstrated expertise in post-training methods &/or next generation use cases for large language models including instruction tuning, RLHF, tool use, reasoning, agents, and multimodal, etc.
Responsibilities
Build, profile and optimize our training and inference framework
Collaborate with ML teams to accelerate their research and development and enable them to develop the next generation of models and data curation
Research and integrate state-of-the-art technologies to optimize our ML system
Other
Strong excitement about system optimization
Strong written and verbal communication skills and the ability to operate in a cross functional team environment
Comprehensive health, dental and vision coverage, retirement benefits, a learning and development stipend, and generous PTO