SK hynix America is seeking to design, architect, and optimize large-scale GPU clusters for AI/ML training and inference workloads to drive the evolution of advancing mobile technology, empowering cloud computing, and pioneering future technologies.
Requirements
- Proven experience designing and deploying large-scale AI/ML clusters in production environments, including clusters with 100+ GPUs.
- Direct involvement in hardware selection, network design, and performance optimization for AI workloads.
- Hands-on expertise with modern GPU architectures from NVIDIA or AMD, plus familiarity with emerging AI accelerator technologies.
- Comprehensive knowledge of AI/ML frameworks and their infrastructure requirements, including PyTorch and distributed training libraries such as DeepSpeed, Megatron-LM, and Ray.
- Understanding of how framework-specific optimizations impact cluster design decisions and how architectural choices affect model training efficiency and scalability.
- Strong background in high-performance networking, including designing low-latency, high-bandwidth network fabrics (e.g., InfiniBand, RoCE, or proprietary interconnects).
- Practical experience integrating cluster design decisions with facility requirements, including Power density considerations based on GPU selection, cooling architecture for varying cluster sizes, and space optimization and data center infrastructure alignment.
Responsibilities
- Architect robust, scalable, and efficient computing clusters that maximize AI workload performance while meeting operational and budgetary constraints.
- Collaborate across hardware capabilities and AI/ML framework requirements, translating model training needs and inference performance targets into concrete system specifications.
- Design end-to-end cluster architectures that encompass compute resources, networking fabric, storage subsystems, and power/cooling integration.
- Select appropriate GPU platforms based on workload characteristics, designing network topologies that minimize communication bottlenecks in distributed training scenarios,
- Architect storage solutions that can sustain the high-throughput demands of large-scale AI operations.
- Conduct detailed performance modeling and capacity planning exercises, predicting cluster behavior under various workload scenarios and identifying potential bottlenecks before deployment.
- Guide decisions on cluster topology, including considerations for rail-optimized designs, spine-leaf architectures, and direct GPU-to-GPU connectivity technologies such as NVLink and InfiniBand configurations.
Other
- Bachelor’s degree in engineering and science discipline with training that matches standard college level training for computer engineering
- 8+ years of professional experience in systems architecture.
- Minimum 3 years dedicated to AI/ML infrastructure design and deployment.
- Ability to collaborate effectively with facility engineers to ensure clusters are operationally feasible.
- Track record of designing clusters supporting diverse workloads from large language model training, to high performance computing and/or computer vision applications.