Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Super Micro Computer Logo

Sr. Reliability Engineer

Super Micro Computer

$145,000 - $165,000
Sep 2, 2025
San Jose, CA, USA
Apply Now

Supermicro is looking to solve the problem of deploying, scaling, and ensuring high availability, performance, scalability, and security across GPU-accelerated compute clusters, Kubernetes workloads, and supporting storage/network infrastructure for their Linux-based AI cloud platforms.

Requirements

  • Proficiency in Linux (Ubuntu, RHEL/CentOS), containers (Docker, Podman), and orchestration (Kubernetes).
  • Experience managing GPU compute clusters (NVIDIA / CUDA, AMD / ROCm)
  • Hands-on experience with observability tools (Prometheus, Grafana, Loki, ELK, etc.).
  • Strong scripting and coding skills (Bash, Python, or Go).
  • Exposure to secure multi-tenant environments and zero trust architectures.
  • Familiarity with network protocols, DNS, DHCP, BGP, ROCEv2, and InfiniBand or high-throughput Ethernet fabrics.
  • Understanding of AI/ML reference architectures and experience with workflows, MLFlow, or Kubeflow.

Responsibilities

  • Cloud Infra Automation: Design and provision cloud infrastructure using Infrastructure as Code (Terraform, Ansible, or Helm) on bare metal or cloud platforms. Develop custom automation and tooling in Python or Go to extend deployment workflows and streamline operations.
  • Platform Reliability: Deploy, scale, maintain, and optimize uptime for AI cloud services including GPU clusters, Kubernetes (K8s), and storage systems (e.g., Ceph, BeeGFS, or Weka). Understand the tools required to benchmark and assure consistent application performance.
  • Monitoring & Alerting: Implement observability tools (e.g., Prometheus, Grafana, ELK, Loki, Fluentd) to monitor system health and alert on anomalies or performance degradation.
  • Capacity Planning: Analyze usage trends and forecast infrastructure needs to support AI workloads and large-scale model training/inference.
  • Incident Management: Lead root cause analysis and resolution for system outages or degraded performance. Define and maintain service level objectives (SLOs), indicators (SLIs), and agreements (SLAs) aligned with uptime and performance goals.
  • CI/CD Integration: Collaborate with DevOps and MLOps teams to ensure reliable delivery pipelines using GitLab CI/CD, ArgoCD, or similar tools.
  • Security & Compliance: Harden Linux systems, manage TLS certificates, and enforce secure access controls via Role-Based Access Control (RBAC), LDAP-integrated SSO, TLS, and network segmentation policies.

Other

  • Bachelor’s degree in Computer Science, Engineering, or a related field—or equivalent experience and 8 years of experience
  • Excellent collaboration and communication skills for cross-team, partner, and customer initiatives
  • Certifications: CKA, CKAD, Linux+, or related credentials
  • Ability to work in a fast-paced environment and adapt to changing priorities
  • Strong problem-solving skills and attention to detail