Shape the future of how Apple delivers software to millions of customers
Requirements
- Experience as a Site Reliability Engineer, DevOps Engineer, or Software Engineer focused on infrastructure in a large-scale distributed environment
- Strong software development skills in a language like Swift, Go, or Python, and a high degree of comfort with shell scripting (Bash)
- Hands-on experience building and managing systems with container orchestration tools (Kubernetes, Docker)
- Deep understanding of networking (TCP/IP, DNS, HTTP) and experience using observability tools (monitoring, logging, tracing) to diagnose complex issues
- Proven experience leading initiatives to reduce technical debt, refactor systems, or improve performance and latency
- Expertise in performance analysis and capacity planning for global, distributed systems
Responsibilities
- Ensure System Reliability: Design, build, and maintain robust, scalable, and observable systems for our core software delivery services
- Automate: Reduce operational toil by developing automation and tooling to prevent and rapidly resolve production issues
- Improve Incident Response: Own and refine our incident management processes to ensure high availability
- Collaborate with Engineers: Partner with development teams to create elegant, high-quality solutions that support the entire workflow, from source code to customer release
- Improve and Modernize Systems: Use a proactive approach to identify and eliminate technical debt to enhance long-term reliability and maintainability
Other
- Excellent problem-solving and communication skills, with a strong sense of ownership and drive
- Deep commitment to building reliable systems and strong collaboration with team members across different timezones