Apple is looking to build the next generation of release technologies that power Apple's development lifecycle to shape the future of how Apple delivers software to millions of customers.
Requirements
- Experience as a Site Reliability Engineer, DevOps Engineer, or Software Engineer focused on infrastructure in a large-scale distributed environment.
- Strong software development skills in a language like Swift, Go, or Python, and a high degree of comfort with shell scripting (Bash).
- Hands-on experience building and managing systems with container orchestration tools (Kubernetes, Docker).
- Deep understanding of networking (TCP/IP, DNS, HTTP) and experience using observability tools (monitoring, logging, tracing) to diagnose complex issues.
- Proven experience leading initiatives to reduce technical debt, refactor systems, or improve performance and latency.
- Expertise in performance analysis and capacity planning for global, distributed systems.
- Experience with large-scale distributed databases (e.g., Cassandra, FoundationDB) or messaging systems (e.g., Kafka).
Responsibilities
- Design, build, and maintain robust, scalable, and observable systems for our core software delivery services.
- Reduce operational toil by developing automation and tooling to prevent and rapidly resolve production issues.
- Own and refine our incident management processes to ensure high availability.
- Partner with development teams to create elegant, high-quality solutions that support the entire workflow, from source code to customer release.
- Use a proactive approach to identify and eliminate technical debt to enhance long-term reliability and maintainability.
Other
- The most important thing is a deep commitment to building reliable systems and strong collaboration with team members across different timezones.
- Excellent problem-solving and communication skills, with a strong sense of ownership and drive.
- Demonstrated ability to lead incident response for high-impact outages.
- Familiarity with using Generative AI (GenAI) or Large Language Models (LLMs) to accelerate operational tasks, such as automating runbooks, generating scripts, or analyzing incident data.