Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Boson Ai Logo

Site Reliability Engineer, AI/ML Infrastructure

Boson Ai

Salary not specified
Dec 15, 2025
Santa Clara, CA, US
Apply Now

The company is looking to solve the problem of running a large GPU cluster in their Toronto datacenter, which includes managing and optimizing HPC cluster operations, planning for future capacity, and evaluating new technologies.

Requirements

  • Proficiency in Linux systems administration (Ubuntu/Debian)
  • Experience with Kubernetes and container orchestration
  • Experience with Ceph >1PB deployments and maintenance
  • Knowledge of security best practices in multi-tenant environments
  • Understanding of L2/L3 networking fundamentals
  • Skilled in Python and Bash scripting
  • Experience with infrastructure-as-code tools (Ansible/Terraform)

Responsibilities

  • Manage and optimize HPC cluster operations
  • Deploy and maintain infrastructure-as-code solutions
  • Support ML/research teams with cluster usage optimization
  • Operate, troubleshoot and optimize Ceph storage clusters
  • Develop automation and tooling

Other

  • 5+ years of experience in SRE or HPC operations
  • Natural problem-solver with a passion for continuous learning
  • Ability to work closely with engineering and science teams