Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

HPC Engineer, AI/ML Infrastructure

Boson AI

Salary not specified

Nov 5, 2025

Santa Clara, CA, United States of America

The company is looking to manage and optimize its High Performance Computing (HPC) infrastructure, specifically a large GPU cluster, to support ML/research teams and ensure smooth operations as they scale.

Requirements

5+ years of experience in HPC operations.
Proficiency in Linux systems administration (Ubuntu/Debian).
Experience with Kubernetes and container orchestration
Knowledge of security best practices in multi-tenant environments.
Understanding of L2/L3 networking fundamentals
Skilled in Python and Bash scripting.
Experience with infrastructure-as-code tools (Ansible/Terraform).

Responsibilities

Manage and optimize HPC cluster operations
Deploy and maintain infrastructure-as-code solutions
Support ML/research teams with cluster usage optimization
Operate, troubleshoot and optimize Ceph storage clusters.
Develop automation and tooling

Other

If you're a natural problem-solver with a passion for continuous learning, we'd love to hear from you.