Lila Sciences is seeking an ML Engineer specializing in distributed and scalable training to design and maintain large-scale training systems, optimize performance for massive models, and integrate cutting-edge techniques to improve efficiency and throughput for their scientific superintelligence platform.
Requirements
- Proven experience with distributed ML training frameworks (Megatron-LM, TorchTitan, DeepSpeed, Ray)
- Strong software engineering skills (Python, C++ kernel contributions are a plus)
- Understanding of large-scale model training techniques
- Experience with cloud or HPC environments
- Prior work with scientific datasets or domain-specific modeling
- Contributions to open-source ML frameworks
Responsibilities
- Design and maintain large-scale training systems
- Optimize performance for massive models
- Integrate cutting-edge techniques to improve efficiency and throughput
- Ray-based distributed training infrastructure for LLMs and multi-modal models
- Performance optimizations for large-scale model training including training and optimization workflows (SFT, MoE, long-context scaling)
- Orchestrate frontier and open source LLMs along with complex compute-intensive tool use
- Scalable pipelines for data preprocessing and experiment orchestration, including tools for efficient data loading, pipeline parallelism, and optimizer tuning
Other
- If this sounds like an environment youâd love to work in, even if you only have some of the experience listed below, we encourage you to apply.