Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Staff Software Engineer, GPU Infrastructure (HPC)

Cohere

Salary not specified

Nov 18, 2025

San Francisco, CA, US

Cohere is looking to build and operate world-class infrastructure and tools to train, evaluate, and serve their foundational AI models, aiming to scale intelligence to serve humanity.

Requirements

Deep expertise in ML/HPC infrastructure: Experience with GPU/TPU clusters, distributed training frameworks (JAX, PyTorch, TensorFlow), and high-performance computing (HPC) environments.
Kubernetes at scale: Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads.
Strong programming skills: Proficiency in Python (for ML tooling) and Go (for systems engineering), with a preference for open-source contributions over reinventing solutions.
Low-level systems knowledge: Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads.
Research collaboration experience: A track record of working closely with AI researchers or ML engineers to solve infrastructure challenges.
Self-directed problem-solving: The ability to identify bottlenecks, propose solutions, and drive impact in a fast-paced environment.

Responsibilities

Build and scale ML-optimized HPC infrastructure: Deploy and manage Kubernetes-based GPU/TPU superclusters across multiple clouds, ensuring high throughput and low-latency performance for AI workloads.
Optimize for AI/ML training: Collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance, leveraging technologies like RDMA, NCCL, and high-speed interconnects.
Troubleshoot and resolve complex issues: Proactively identify and resolve infrastructure bottlenecks, performance degradation, and system failures to ensure minimal disruption to AI/ML workflows.
Enable researchers with self-service tools: Design intuitive interfaces and workflows that allow researchers to monitor, debug, and optimize their training jobs independently.
Drive innovation in ML infrastructure: Work closely with AI researchers to understand emerging needs (e.g., JAX, PyTorch, distributed training) and translate them into robust, scalable infrastructure solutions.
Champion best practices: Advocate for observability, automation, and infrastructure-as-code (IaC) across the organization, ensuring systems are maintainable and resilient.
Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.

Other

All of our infrastructure roles require participating in a 24x7 on-call rotation, where you are compensated for your on-call schedule.
Mentorship and collaboration: Share expertise through code reviews, documentation, and cross-team collaboration, fostering a culture of knowledge transfer and engineering excellence.
We value and celebrate diversity and strive to create an inclusive work environment for all.
We welcome applicants from all backgrounds and are committed to providing equal opportunities.
Should you require any accommodations during the recruitment process, please submit an Accommodations Request Form, and we will work together to meet your needs.