Testing AI/ML Interconnect Solutions
Requirements
- Strong understanding of Ethernet functionality, TCP/IP networking, virtualization technologies, RDMA, PCIe protocol - Gen3 & above.
- Good understanding of AI/ML clusters, Deep learning models, and GPU Micro benchmarks.
- Strong networking experience with protocol testing & validations.
- Experience with L2/L3 protocols especially RoCE( RDMA over Converged Ethernet ) protocol & use cases in AI/ML, HPC clusters.
- Experience on AMD/NVIDIA GPUs, Communication Collectives - RCCL/NCCL & libraries - RoCM/CUDA.
- Experience in utilizing automation scripts in Python – primarily network and system-level programming using Python.
- Having experience with network test equipment – Protocol/PCIe Analyzers, Protocol Jammers, Load Generators (Ixia, Ixchariot, Medusa tools, etc) is a plus
Responsibilities
- Creation and review of Test scenarios, Test cases, and Test Automation
- Reviews of design and functional specifications created by the development team to understand product functionality.
- Execute test activities and work closely with multi-site team of developers and testers
- Review User Documentation to ensure it clearly documents product functionality
- Prioritize and manage multiple, parallel tasks, projects & releases
Other
- Highly focused and motivated engineer
- Strong analytical, problem-solving skills & debugging skills.
- Possess excellent communication skills and need to be a critical thinker and a self-starter.
- Possess a strong “break feature mentality”
- Possess a strong engineering mindset to develop thorough test cases