Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Boson Ai Logo

Site Reliability Engineer, AI/ML Infrastructure

Boson Ai

Salary not specified
Nov 14, 2025
Santa Clara, CA, United States of America
Apply Now

The company is looking for a Senior Site Reliability Engineer to manage and optimize their GPU cluster infrastructure in their Toronto datacenter, ensuring smooth operations for ML/research teams and planning for future scaling.

Requirements

  • Proficiency in Linux systems administration (Ubuntu/Debian).
  • Experience with Kubernetes and container orchestration
  • Experience with Ceph >1PB deployments and maintenance
  • Knowledge of security best practices in multi-tenant environments.
  • Understanding of L2/L3 networking fundamentals
  • Skilled in Python and Bash scripting.
  • Experience with infrastructure-as-code tools (Ansible/Terraform).

Responsibilities

  • Manage and optimize HPC cluster operations
  • Deploy and maintain infrastructure-as-code solutions
  • Support ML/research teams with cluster usage optimization
  • Operate, troubleshoot and optimize Ceph storage clusters.
  • Develop automation and tooling

Other

  • 5+ years of experience in SRE or HPC operations.
  • If you're a natural problem-solver with a passion for continuous learning, we'd love to hear from you.