Apple Services Engineering (ASE) team needs to build and provide platforms, services, and infrastructure that fuel Apple's services (such as iCloud, iTunes, Siri, and Maps) and ensure these services scale globally, remain highly available, and operate flawlessly for hundreds of millions of users.
Requirements
- Understanding of SRE principles, includes monitoring, alerting, error budgets, fault analysis, capacity planning, automation and toil reduction.
- Proficiency in at least one programming language - python, go or Java.
- Experience managing and scaling distributed systems in a public, private, or hybrid cloud environment.
- Experience with microservices architecture and container orchestration using Kubernetes or similar technologies.
- Proficiency in both backend coding technologies (python, go and java) and frontend coding technologies (javascript and its variants)
- Strong understanding of Linux operating system fundamentals, networking principles, and system management.
Responsibilities
- Operate, monitor, and triage all aspects of our production and non-production environments.
- Pioneer and implement the next-generation telemetry system.
- Prepare alert handling procedures, runbooks, and collaborate with the off-shore SRE teams.
- Automate deployment and orchestration of services into the cloud environment as well as other routine processes.
- Actively participate in capacity planning, scale testing, and disaster recovery exercises.
- Interact with and support partner teams, including engineering, QA, and program management.
- Cultivate and maintain relationships with internal and external third-party vendors.
Other
- 6+ years of demonstrated expertise in Site Reliability Engineering, Infrastructure Ops or DevOps-focused role.
- BS or MS in Computer Science / related fields or equivalent work experience.
- Experience running Tier 1 services for 24/7 support.
- Strong sense of ownership, with a desire to communicate and collaborate with other engineers and teams.