Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Senior System Engineer

$140,000 - $158,000

Sep 20, 2025

San Jose, CA, USA

Supermicro is looking to solve business critical applications and services issues, and maintain HPC/AI infrastructure performance

Experience with leading AI/ML frameworks such as PyTorch, TensorFlow, ONNX, etc
Experience with DevOps or in cloud environments, including but not limited to Docker/Containers and Kubernetes
Hands-on experience with workload/scheduler Managers (Slurm) for rack/cluster
Familiar with MLPerf Training/Inference benchmark, LLM, HPL-AI or RCCL/NCCL
Programming experience with windows and Linux shell scripting
Familiar with Intel/AMD/NVIDIA development tool kits such as CUDA, oneAPI, ROCm is a plus
Experience with server/network hardware debugging and troubleshooting is a plus

Execute comprehensive system-level rack tests on latest NVidia and AMD GPUs, ARM-based, Intel Xeon, and AMD EPYC processors, encompassing functionality, compatibility, performance, stress, and reliability testing, leveraging proprietary in-house tools
Establish expertise in HPC/AI applications and benchmarks, delivering impactful training sessions to customers and partners, while addressing complex customer support issues, demonstrating innovative problem-solving skills and building robust processes and procedures for HPC/AI solutions
Conduct proof of concept design and testing, providing optimized benchmarks for HPC/AI applications in a timely manner
Deliver on-site deployment services, ensuring customer acceptance verification and providing post-level 1&2 support
Identify and document hardware and software quality issues and collaborate with Product Management and other Engineering teams to integrate customer feedback into future product enhancements
Proactively engage in HPC roadmap development, planning software and hardware upgrades to sustain exceptional HPC infrastructure performance
Document and analyze test plans, reports, logs, and actively contribute to the development of test utilities and automation scripts to streamline testing processes