NVIDIA is looking to solve the problem of developing and executing test plans for their HGX/DGX/MGX platform on servers, OS, FW, and CUDA SW stack, and to improve the reliability and validation of their products.
Requirements
- Proven years of OS and server level automation, CI/CD process and DevOps experience using Python, SHELL, Ansible, Jenkins, C/C++, Java, JavaScript
- Strong server and Linux(Ubuntu, RedHat, CentOS, SuSE, Fedora and etc…) troubleshooting and debugging experience in a bare-metal and KVM/VMWare/Hyper-V environment.
- Good knowledge and hands-on experience in model testing, AI tools/frameworks (TensorFlow, Pytorch, Cursor and etc…), NLP and LLM benchmarking
- Experience in using AI development tools for test plans creation, test cases development and test cases automation
- Strong experience in FW, BMC/OpenBMC, Network protocol, internal/external enterprise storage devices, PCIe buses and devices, IO sub-devices, CPU and memory, ACPI, UEFI spec, Redfish
- Proven years of experience in GitHub/Gitlab/Gerrit, PXE, SLURM, Stack/Kubernetes/Docker
- Experience working with NVIDIA GPU hardware
Responsibilities
- Responsible for the development and execution of NVIDIA HGX/DGX/MGX platform test plan on servers, OS, FW and CUDA SW stack from design doc.
- Installing and testing various systems OS, server firmware and SW stack.
- Drive support for root cause analysis on reliability and validation test failures to identify root cause(s) and achieve mitigation.
- Build, develop/debug server and OS level automation front-end and back-end framework and tests
- Review partner and supplier test results and prescribe additional reliability testing on components, servers, and packaging as needed.
- Work in an agile software development team with very high production quality standards.
- Manage bug lifecycle and collaborate with inter-groups to drive for solutions.
Other
- Bachelor’s Degree (or equivalent experience) in a STEM (Science, Technology, Engineering, Math or Physics) field
- 5+ years proven experience; or master’s degree.
- Outstanding interpersonal skills and possesses a strong sense of engagement and continuous process improvement.
- Dedicated, forward-thinking, and hard-working
- Ability to work in a diverse work environment