The company is looking to solve the problem of running a large GPU cluster in their Toronto datacenter, which includes managing and optimizing HPC cluster operations, planning for future capacity, and evaluating new technologies.
Requirements
- Proficiency in Linux systems administration (Ubuntu/Debian)
- Experience with Kubernetes and container orchestration
- Experience with Ceph >1PB deployments and maintenance
- Knowledge of security best practices in multi-tenant environments
- Understanding of L2/L3 networking fundamentals
- Skilled in Python and Bash scripting
- Experience with infrastructure-as-code tools (Ansible/Terraform)
Responsibilities
- Manage and optimize HPC cluster operations
- Deploy and maintain infrastructure-as-code solutions
- Support ML/research teams with cluster usage optimization
- Operate, troubleshoot and optimize Ceph storage clusters
- Develop automation and tooling
Other
- 5+ years of experience in SRE or HPC operations
- Natural problem-solver with a passion for continuous learning
- Ability to work closely with engineering and science teams