AMD is looking to ensure the quality, performance, and reliability of multi-node GPU communication libraries that enable high-performance computing and machine learning workloads at Exascale.
Requirements
- Strong background in software testing and quality assurance methodologies, including test automation, performance testing, and system-level validation.
- Proficiency in developing test scripts and automation frameworks using Python and Shell scripting.
- Experience with Linux/UNIX environments and cluster-computing concepts.
- Familiarity with network technologies relevant to HPC, such as RoCE (RDMA over Converged Ethernet), Libfabric, and InfiniBand.
- In-depth knowledge of best-practices in software quality assurance, including testing types, regression analysis, defect tracking (e.g., JIRA), and version control (e.g., Git).
- Experience with collective communication libraries like MPI, RCCL, or SHMEM.
- Understanding of the software development lifecycle (SDLC) and experience working within Agile/Scrum methodologies.
Responsibilities
- Design, develop, and execute comprehensive test plans, test cases, and test scripts (functional, performance, stress, and regression) for AMD's RCCL (an open-source, GPU-accelerated communication collective middleware) and related technologies.
- Validate networking features for multi-GPU and multi-node communication libraries, focusing on reliability, throughput, and latency.
- Establish and maintain automated test frameworks using languages like Python to ensure continuous integration and quality gates.
- Benchmark and profile the libraries on single-GPU, multi-GPU, and clustered systems to verify performance optimizations and identify regressions.
- Isolate, report, and track defects with clear, detailed, and reproducible steps, collaborating closely with development engineers to expedite resolution.
- Deploy the libraries on large clusters and participate in debugging complex, system-level issues that span across different layers of the software stack: GPU kernel drivers, NIC drivers, etc.
- Contribute to high-quality test documentation and participate in reviews of design and architectural specifications to ensure testability.
Other
- Accustomed to working in a dynamic, geographically distributed agile team, where partnership and collaboration are paramount.
- Possess excellent written and verbal communication skills, a meticulous attention to detail, and the ability to express your work in a clear, cohesive fashion.
- Results-oriented and accustomed to tight deadlines and changing priorities.
- Constantly thinking of ways to break software and ensure optimal performance and defect-free execution across various hardware configurations.
- B.Sc. or B.Eng. degree in Computer Science, Software Engineering, Electrical Engineering, or equivalent.