Apple's AI/ML team is looking to build groundbreaking technology and needs a senior engineering program manager for its ML Training platform to provide efficient and scalable compute and processing for machine learning lifecycle
Requirements
- 6+ years of Product and/or technical program management experience covering as many of the following areas as possible: data center infrastructure, on-prem compute platforms, GPU/TPU or custom accelerator evaluation, price-performance modeling, capacity planning and integration, container stack, and networking.
- Experience with supply demand and finance ops management related to ML Compute accelerators.
- Proven track record driving programs that involve evaluating and onboarding new accelerator technologies and integrating them into large-scale ML training platforms
- Strong analytical skills in conducting cost-performance trade-off analysis and ensuring hardware/software optimization for ML workloads
- Proficiency in multitasking and leading sophisticated programs with a track record in bringing highly technical infrastructure to production
- BS in EE, CS, Systems Engineering or equivalent work experience
Responsibilities
- As a key EPM of the AI/ML team, you will be responsible for establishing cross-functional partnerships with ML stakeholders understanding their use cases and improve the ease of use of the compute services.
- You will lead supply/demand management along with finance operations for ML Compute accelerators.
- You will collaborate with the Apple’s AIML engineering teams to define a partnership strategy across the entire ML ecosystem including 3rd party public cloud, Apple’s internal cloud, silicon vendors, and OSS providers
- You will lead technical evaluations of emerging accelerator technologies (GPUs, TPUs, and custom silicon), balancing power, performance, cost, and compatibility to ensure infrastructure decisions are forward-looking and cost-efficient
- Own cross-functional execution of capacity integration efforts - aligning hardware procurement, rack deployment, networking, and software readiness to ensure seamless delivery at scale
- Translate ML platform and model training requirements into concrete infrastructure programs, ensuring technical risks are anticipated and mitigated early
Other
- Self-motivated, independent, and proactive; demonstrated creative and critical thinking capabilities; can quickly (realtime) triage, prioritize, and lead cross-functional teams under pressure
- Outstanding interpersonal skills, including the ability to influence hardware vendors, public cloud providers, and internal engineering teams
- Strong desire to learn, aptitude for problem solving, and the ability to make sophisticated trade-offs
- BS in EE, CS, Systems Engineering or equivalent work experience
- MBA/MS in EE, CS, Systems Engineering or equivalent (Preferred Qualification)