SK hynix America is looking to develop and operate high-performance computing clusters to support AI/ML workloads, ensuring scalability, performance, reliability, and cost-effectiveness of their AI data center IT environments.
Requirements
- 2+ years of experience in AI cluster engineering, MLOps, and benchmark testing, including GPU performance analysis, memory usage, and energy/power monitoring tools.
- Strong familiarity with AI computing architecture, AI/ML infrastructure requirements, memory architecture and usages in AI/ML, AI algorithm trends and best practices.
- Expertise in optimizing resource utilization, improving system throughput, and reducing latency in both training and inference.
Responsibilities
- Design and implement distributed computing cluster infrastructure to support large-scale AI/ML model training and inference jobs with a focus on transformer-based AI models.
- Build and maintain distributed system to ensure scalability, efficient resource allocation, and high throughput.
- Optimize cluster performance through hardware selection, equipment configuration, network engineering, and performance analysis.
- Deploy and operate data center networking infrastructure using software system for automation, design validation, deployment, and operational support.
- Implement tools and processes to maintain high uptime and ensure infrastructure reliability during both model training and inference phases.
- Identify and resolve performance bottlenecks, improving overall system throughput and response times.
- Collaborate with cross-functional teams, including research, security, and benchmark test engineering teams, to integrate infrastructure with AI workflows, ensuring seamless deployment and operation.
Other
- Master’s degree or above in Computer Science, Electrical Engineering, or related fields.
- Engage with technology vendors and partners to evaluate new solutions to drive innovation in AI computing infrastructure.
- Work Model: Onsite
- Office Location: San Jose, CA