Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Site Reliability Engineer, AI/ML Infrastructure

Boson Ai

Salary not specified

Dec 15, 2025

Santa Clara, CA, US

The company is looking to solve the problem of running a large GPU cluster in their Toronto datacenter, which includes managing and optimizing HPC cluster operations, planning for future capacity, and evaluating new technologies.

Requirements

Proficiency in Linux systems administration (Ubuntu/Debian)
Experience with Kubernetes and container orchestration
Experience with Ceph >1PB deployments and maintenance
Knowledge of security best practices in multi-tenant environments
Understanding of L2/L3 networking fundamentals
Skilled in Python and Bash scripting
Experience with infrastructure-as-code tools (Ansible/Terraform)

Responsibilities

Manage and optimize HPC cluster operations
Deploy and maintain infrastructure-as-code solutions
Support ML/research teams with cluster usage optimization
Operate, troubleshoot and optimize Ceph storage clusters
Develop automation and tooling

Other

5+ years of experience in SRE or HPC operations
Natural problem-solver with a passion for continuous learning
Ability to work closely with engineering and science teams