Apple Silicon GPU SW architecture team is seeking a senior/principal engineer to lead server-side ML acceleration and multi-node distribution initiatives to help define and shape our future GPU compute infrastructure on Private Cloud Compute that enables Apple Intelligence.
Requirements
- Strong knowledge of GPU programming (CUDA, ROCm) and high-performance computing
- Must have excellent system programming skills in C/C++, Python is a plus
- Deep understanding of distributed systems and parallel computing architectures
- Experience with inter-node communication technologies (InfiniBand, RDMA, NCCL) in the context of ML training/inference
- Understand how tensor frameworks (PyTorch, JAX, TensorFlow) are used in distributed training/inference
- Familiar with model development lifecycle from trained model to large scale production inference deployment
- Proven track record in ML infrastructure at scale
Responsibilities
- Design and implement tensor/data/expert parallelism strategies for large language model inference across distributed server cluster environments
- Drive hardware and software roadmap decisions for ML acceleration
- Expert in designing architectures that achieves peak compute utilizations and optimal memory throughput
- Develop and optimize distributed inference systems with focus on latency, throughput, and resource efficiency across multiple nodes
- Architect scalable ML serving infrastructure supporting dynamic model sharding, load balancing, and fault tolerance
- Collaborate with hardware teams on next-generation accelerator requirements and software teams on framework integration
- Lead performance analysis and optimization of ML workloads, identifying bottlenecks in compute, memory, and network subsystems
Other
- Technical BS/MS degree
- This is a hands-on technical leadership position