Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Super Micro Computer Logo

Sr. System Engineer

Super Micro Computer

$137,000 - $156,000
Sep 5, 2025
San Jose, CA, USA
Apply Now

Supermicro is looking to roll out and maintain business critical applications and services, resolve escalated service issues, and engineer and implement complex projects.

Requirements

  • 8+ years of work-related experience in Deep Learning and Machine Learning
  • 8+ years of Linux/networking debugging/testing or relevant experience preferred
  • Experience with leading AI/ML frameworks such as PyTorch, TensorFlow, ONNX, etc.
  • Experience with DevOps or in cloud environments, including but not limited to Docker/Containers and Kubernetes
  • Hands-on experience with workload/scheduler Managers (Slurm) for rack/cluster
  • Familiar with MLPerf Training/Inference benchmark, LLM, HPL-AI or RCCL/NCCL
  • Programming experience with windows and Linux shell scripting

Responsibilities

  • Execute comprehensive system-level rack tests on latest NVidia and AMD GPUs, ARM-based, Intel Xeon, and AMD EPYC processors, encompassing functionality, compatibility, performance, stress, and reliability testing, leveraging proprietary in-house tools
  • Establish expertise in HPC/AI applications and benchmarks, delivering impactful training sessions to customers and partners, while addressing complex customer support issues, demonstrating innovative problem-solving skills and building robust processes and procedures for HPC/AI solutions
  • Conduct proof of concept design and testing, providing optimized benchmarks for HPC/AI applications in a timely manner. Fine-tune BIOS settings, optimize OS/network configurations, and develop diverse simulation configurations to enhance efficiency across various workloads
  • Deliver on-site deployment services, ensuring customer acceptance verification and providing post-level 1&2 support. Create and maintain technical documentation, including technical notes, blogs, and diagrams, to facilitate knowledge dissemination
  • Identify and document hardware and software quality issues and collaborate with Product Management and other Engineering teams to integrate customer feedback into future product enhancements
  • Proactively engage in HPC roadmap development, planning software and hardware upgrades to sustain exceptional HPC infrastructure performance
  • Document and analyze test plans, reports, logs, and actively contribute to the development of test utilities and automation scripts to streamline testing processes

Other

  • Independent with leadership to drive the technical development and with excellent communication skills.
  • Strong sense of teamwork and good team player, strong communication skills
  • Requires regular in-office attendance.
  • In-office collaboration and participation in team meetings, training sessions, and other on-site activities are essential aspects of this role.
  • Candidates should consider the commuting distance and be prepared to fulfill their responsibilities in the designated office location.