Google's Site Reliability Engineering (SRE) team needs to ensure the reliability, uptime, and performance of large-scale, massively distributed, fault-tolerant systems. This involves optimizing existing systems, building infrastructure, eliminating work through automation, and managing the complex challenges of scale unique to Google.
Requirements
- 8 years of experience with software development in one or more programming languages.
- 3 years of experience designing, analyzing, and troubleshooting distributed systems.
Responsibilities
- Design, write and deliver software to improve the availability, scalability, latency and efficiency of Google's services.
- Own end-to-end availability and performance of key services and build automation to prevent problem recurrence.
- Automate response to all non-exceptional service conditions.
- Manage on-call rotations across continents, using a follow-the-sun model.
- Lead a team of Software/Systems Engineers on projects for users and be directly responsible for uptime.
- Ensure that Google's services have reliability, uptime appropriate to users' needs and a fast rate of improvement.
- Keep an ever-watchful eye on our systems capacity and performance.
Other
- 3 years of experience managing people or teams.
- 3 years of experience leading projects.
- Lead a team of Software/Systems Engineers on projects for users and be directly responsible for uptime.
- Lead by example, mentor the team and establish credibility through quality technical execution.
- Google is proud to be an equal opportunity workplace and is an affirmative action employer.