Cohere is looking to hire a senior engineer to build, maintain, and evolve the training framework that powers their frontier-scale language models, aiming to increase the capabilities of their models and the value they drive for customers.
Requirements
- Strong engineering experience in large-scale distributed training or HPC systems.
- Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
- Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
- Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
- Experience working with containerized environments (Docker, Singularity/Apptainer).
- A track record of building tools that increase developer velocity for ML teams.
- Experience with training LLMs or other large transformer architectures.
Responsibilities
- Build and own the training framework responsible for large-scale LLM training.
- Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
- Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
- Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
- Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high-performance training.
- Investigate and resolve performance bottlenecks across the ML systems stack.
- Build robust systems that ensure reproducible, debuggable, large-scale runs.
Other
- Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.
- Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.
- We value and celebrate diversity and strive to create an inclusive work environment for all.
- We welcome applicants from all backgrounds and are committed to providing equal opportunities.
- Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form