Cohere is looking to scale intelligence to serve humanity by training and deploying frontier models for developers and enterprises building AI systems, and this role aims to enhance the global quality of the post-training codebase
Requirements
- Extremely strong software engineering skills
- Value test-driven development methods, clean code, and strive to reduce technical debts at all levels
- Proficiency in Python and related ML frameworks such as JAX, Pytorch and/or XLA/MLIR
- Experience using and debugging large-scale distributed training strategies (memory/speed profiling)
- Experience with distributed training infrastructures (Kubernetes) and associated frameworks (Ray)
- Hands-on experience with the post-training phase of model training, with a strong emphasis on scalability and performance
- Experience in ML, LLM and RL academic research
Responsibilities
- Design and write high-performing and scalable software for training models
- Develop new tools to support and accelerate research and LLM training
- Coordinate with other engineering teams (Infrastructure, Efficiency, Serving) and the scientific teams (Agent, Multimodal, Multilingual, etc.) to create a strong and integrated post-training ecosystem
- Craft and implement techniques to improve performance and speed up our training cycles, both on SFT, offline preference, and the RL regime
- Research, implement, and experiment with ideas on our cluster and data infrastructure
- Collaborate, Collaborate, and Collaborate with other scientists, engineers, and teams!
Other
- Have a deep passion for quality work
- Enjoy tuning and optimising large LLM models
- Comfortable working with people with different levels of software engineering skills, from beginner to more advanced
- Comfortable diving into complex ML codebases to identify and resolve issues, ensuring the smooth operation of our systems
- Thrive in a fast-paced, technically challenging environment, where you can contribute your innovative ideas and solutions