Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Cohere Logo

Senior ML Systems Engineer, Frameworks & Tooling

Cohere

Salary not specified
Dec 1, 2025
New York, NY, US • San Francisco, CA, US
Apply Now

Cohere is looking to hire a senior engineer to build, maintain, and evolve the training framework that powers their frontier-scale language models, aiming to increase the capabilities of their models and the value they drive for customers.

Requirements

  • Strong engineering experience in large-scale distributed training or HPC systems.
  • Deep familiarity with JAX internals, distributed training libraries, or custom kernels/fused ops.
  • Experience with multi-node cluster orchestration (Slurm, Ray, Kubernetes, or similar).
  • Comfort debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines.
  • Experience working with containerized environments (Docker, Singularity/Apptainer).
  • A track record of building tools that increase developer velocity for ML teams.
  • Experience with training LLMs or other large transformer architectures.

Responsibilities

  • Build and own the training framework responsible for large-scale LLM training.
  • Design distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO strategies, memory management, checkpointing).
  • Improve training throughput and stability on multi-node clusters (e.g., GB200/300, AMD, H200/100).
  • Develop and maintain tooling for monitoring, logging, debugging, and developer ergonomics.
  • Collaborate closely with infra teams to ensure Slurm setups, container environments, and hardware configurations support high-performance training.
  • Investigate and resolve performance bottlenecks across the ML systems stack.
  • Build robust systems that ensure reproducible, debuggable, large-scale runs.

Other

  • Excellent judgment around trade-offs: performance vs complexity, research velocity vs maintainability.
  • Strong collaboration skills — you’ll work closely with infra, research, and deployment teams.
  • We value and celebrate diversity and strive to create an inclusive work environment for all.
  • We welcome applicants from all backgrounds and are committed to providing equal opportunities.
  • Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form