Cohere is looking to build and operate world-class infrastructure and tools to train, evaluate, and serve their foundational AI models, aiming to scale intelligence to serve humanity.
Requirements
- Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments.
- Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads.
- Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions.
- Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads.
- Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges.
- Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.
Responsibilities
- Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads.
- Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects.
- Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows.
- Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently.
- Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions.
- Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient.
- Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.
Other
- All of our infrastructure roles require participating in a 24x7 on-call rotation, where you are compensated for your on-call schedule.
- Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.
- We value and celebrate diversity and strive to create an inclusive work environment for all.
- We welcome applicants from all backgrounds and are committed to providing equal opportunities.
- Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.