The company is looking to manage and optimize its High Performance Computing (HPC) infrastructure, specifically a large GPU cluster, to support ML/research teams and ensure smooth operations as they scale.
Requirements
- 5+ years of experience in HPC operations.
- Proficiency in Linux systems administration (Ubuntu/Debian).
- Experience with Kubernetes and container orchestration
- Knowledge of security best practices in multi-tenant environments.
- Understanding of L2/L3 networking fundamentals
- Skilled in Python and Bash scripting.
- Experience with infrastructure-as-code tools (Ansible/Terraform).
Responsibilities
- Manage and optimize HPC cluster operations
- Deploy and maintain infrastructure-as-code solutions
- Support ML/research teams with cluster usage optimization
- Operate, troubleshoot and optimize Ceph storage clusters.
- Develop automation and tooling
Other
- If you're a natural problem-solver with a passion for continuous learning, we'd love to hear from you.