Cohere is looking to build and operate world-class infrastructure and tools to train, evaluate, and serve their foundational models, aiming to scale intelligence to serve humanity by supporting AI researchers and accelerating the development of industry-leading AI models.
Requirements
- Have deep experience running Kubernetes clusters at scale and/or scaling and troubleshooting Cloud Native infrastructure, including Infrastructure as Code
- Have strong programming skills in Go or Python
- Prefer contributing to Open Source solutions rather than building solutions from the ground up
- You've previously worked with ML training infrastructure and GPU workloads and have familiarity with RDMA networking
- You have expertise to support and troubleshoot low level Linux systems
- You have experience collaborating with research teams or machine learning engineers
Responsibilities
- Build and operate Kubernetes compute superclusters across multiple clouds
- Partner with cloud providers to optimize infrastructure costs, performance, and reliability for AI workloads
- Work closely with research teams to understand their infrastructure needs and identify ways to improve stability, performance, and efficiency of novel model training techniques
- Design and build resilient, scalable systems for training AI models, focusing on creating intuitive user interfaces that empower researchers to self-serve to troubleshoot and resolve problems
- Encourage software best practices across our company and participate in team processes such as knowledge sharing, reviews, and on-call
Other
- All of our infrastructure roles require participating in a 24x7 on-call rotation, where you are compensated for your on-call schedule.
- Are self-directed and adaptable, and excel at identifying and solving key problems
- Draw motivation from building systems that help others be more productive
- See mentorship, knowledge transfer, and review as essential prerequisites for a healthy team
- Have excellent communication skills and thrive in fast-paced environments