Supermicro is looking to solve business critical applications and services issues, including resolving escalated service issues, coaching other engineers, and engineering and implementing complex projects.
Requirements
- Experience with leading AI/ML frameworks such as PyTorch, TensorFlow, ONNX, etc.
- Experience with DevOps or in cloud environments, including but not limited to Docker/Containers and Kubernetes
- Hands-on experience with workload/scheduler Managers (Slurm) for rack/cluster
- Familiar with MLPerf Training/Inference benchmark, LLM, HPL-AI or RCCL/NCCL
- Programming experience with windows and Linux shell scripting
- Familiar with Intel/AMD/NVIDIA development tool kits such as CUDA, oneAPI, ROCm is a plus
- Experience with server/network hardware debugging and troubleshooting is a plus
Responsibilities
- Execute comprehensive system-level rack tests on latest NVidia and AMD GPUs, ARM-based, Intel Xeon, and AMD EPYC processors, encompassing functionality, compatibility, performance, stress, and reliability testing, leveraging proprietary in-house tools.
- Establish expertise in HPC/AI applications and benchmarks, delivering impactful training sessions to customers and partners, while addressing complex customer support issues, demonstrating innovative problem-solving skills and building robust processes and procedures for HPC/AI solutions.
- Conduct proof of concept design and testing, providing optimized benchmarks for HPC/AI applications in a timely manner. Fine-tune BIOS settings, optimize OS/network configurations, and develop diverse simulation configurations to enhance efficiency across various workloads.
- Deliver on-site deployment services, ensuring customer acceptance verification and providing post-level 1&2 support. Create and maintain technical documentation, including technical notes, blogs, and diagrams, to facilitate knowledge dissemination.
- Identify and document hardware and software quality issues and collaborate with Product Management and other Engineering teams to integrate customer feedback into future product enhancements.
- Proactively engage in HPC roadmap development, planning software and hardware upgrades to sustain exceptional HPC infrastructure performance.
- Document and analyze test plans, reports, logs, and actively contribute to the development of test utilities and automation scripts to streamline testing processes.
Other
- BS/MS in Electrical Engineering, Computer Engineering or Computer Science
- 1+ years of work-related experience in Deep Learning and Machine Learning
- Strong sense of teamwork and good team player, strong communication skills
- CCNA, OpenStack, OpenShift, Azure or AWS is a plus
- Travel may be required for on-site deployment services