Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Software Delivery - Site Reliability Engineer

Apple

$134,800 - $245,800

Sep 12, 2025

Cupertino, CA, US

Apple is looking to build the next generation of release technologies that power Apple's development lifecycle to shape the future of how Apple delivers software to millions of customers.

Requirements

Experience as a Site Reliability Engineer, DevOps Engineer, or Software Engineer focused on infrastructure in a large-scale distributed environment.
Strong software development skills in a language like Swift, Go, or Python, and a high degree of comfort with shell scripting (Bash).
Hands-on experience building and managing systems with container orchestration tools (Kubernetes, Docker).
Deep understanding of networking (TCP/IP, DNS, HTTP) and experience using observability tools (monitoring, logging, tracing) to diagnose complex issues.
Proven experience leading initiatives to reduce technical debt, refactor systems, or improve performance and latency.
Expertise in performance analysis and capacity planning for global, distributed systems.
Experience with large-scale distributed databases (e.g., Cassandra, FoundationDB) or messaging systems (e.g., Kafka).

Responsibilities

Design, build, and maintain robust, scalable, and observable systems for our core software delivery services.
Reduce operational toil by developing automation and tooling to prevent and rapidly resolve production issues.
Own and refine our incident management processes to ensure high availability.
Partner with development teams to create elegant, high-quality solutions that support the entire workflow, from source code to customer release.
Use a proactive approach to identify and eliminate technical debt to enhance long-term reliability and maintainability.

Other

The most important thing is a deep commitment to building reliable systems and strong collaboration with team members across different timezones.
Excellent problem-solving and communication skills, with a strong sense of ownership and drive.
Demonstrated ability to lead incident response for high-impact outages.
Familiarity with using Generative AI (GenAI) or Large Language Models (LLMs) to accelerate operational tasks, such as automating runbooks, generating scripts, or analyzing incident data.