Apple is looking to solve large-scale ML training challenges by enhancing distributed cloud training techniques for foundation models and operationalizing large-scale ML workloads on Kubernetes.
Requirements
- 1+ years of hands-on experience in building scalable backend systems for training and evaluation of machine learning models
- Proficient in relevant programming languages, like Python or Go
- Strong expertise in distributed systems, reliability and scalability, containerization, and cloud platforms
- Proficient in cloud computing infrastructure and tools: Kubernetes, Ray, PySpark
- Proficient in working with and debugging accelerators, like: GPU, TPU, AWS Trainium
- Proficient in ML training and deployment frameworks, like: JAX, Tensorflow, PyTorch, TensorRT, vLLM
- Ability to clearly and concisely communicate technical and architectural problems, while working with partners to iteratively find solutions
Responsibilities
- Drive large-scale training initiatives to support our most complex models.
- Operationalize large-scale ML workloads on Kubernetes.
- Enhance distributed cloud training techniques for foundation models.
- Design and integrate end-to-end lifecycles for distributed ML systems
- Develop tools and services to optimize ML systems beyond model selection.
- Architect a robust MLOps platform to support seamless ML operations.
- Collaborate with cross-functional engineers to solve large-scale ML training challenges.
Other
- Bachelors in Computer Science, engineering, or a related field
- Advance degrees in Computer Science, engineering, or a related field
- Lead complex technical projects, defining requirements and tracking progress with team members.
- Mentor engineers in areas of your expertise, fostering skill growth and knowledge sharing.
- Cultivate a team centered on collaboration, technical excellence, and innovation.