Google's Site Reliability Engineering (SRE) team needs to ensure the reliability, uptime, and performance of large-scale, massively distributed, fault-tolerant systems and services, both internal and external, while driving continuous improvement and managing capacity.
Requirements
- 8 years of experience with data structures or algorithms.
- 5 years of experience with software development in one or more programming languages.
- 3 years of experience designing, analyzing, and troubleshooting distributed systems.
- Design, write and deliver software to improve the availability, scalability, latency and efficiency of Google's services.
Responsibilities
- Build and run large-scale, massively distributed, fault-tolerant systems.
- Ensure that Google's services have reliability, uptime appropriate to users' needs and a fast rate of improvement.
- Keep an ever-watchful eye on our systems capacity and performance.
- Optimize existing systems, build infrastructure and eliminate work through automation.
- Manage the complex challenges of scale which are unique to Google, while using your expertise in coding, algorithms, complexity analysis and large-scale system design.
- Lead a team of Software/Systems Engineers on projects for users and be directly responsible for uptime.
- Own end-to-end availability and performance of key services and build automation to prevent problem recurrence.
Other
- Bachelor’s degree in Computer Science, a related field, or equivalent practical experience.
- 3 years of experience managing people or teams, leading projects.
- Master's degree in Computer Science or Engineering.
- 1 year of people management experience.
- Lead a team and be responsible for products globally, providing technical leadership to key projects and empowering and developing teams to do the same.