Palo Alto Networks runs a large hybrid infrastructure and is one of the largest GCP customers. As a Site Reliability Engineer, you will be part of a team supporting the services running on this infrastructure. This includes automation, architecture, performance, metrics, troubleshooting, security, and reliability.
Requirements
- Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
- Proficient in Python and/or Go
- Expertise in managing applications in the Kubenetes cluster with autoscaling enabled
- Expertise in the public cloud (GCP or AWS), especially in GCP
- Strong Linux administration, internals, and network troubleshooting
- Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
- Experience with CI/CD pipelines, GitLab, and GitHub preferred
Responsibilities
- Design, build, and operate reliable, secure Cloud infrastructure
- Ensure that applications are production-ready, scalable, and reliable
- Develop tools and automation frameworks
- Automate robust deployment of robust services
- Orchestrate end-to-end monitoring and alerting
- Lead root cause analysis of critical business and production issues
- Participate in design reviews
Other
- Contribute to the success of SRE and DevOps
- Develop expertise in new technologies
- Work with developers, researchers, data scientists, and security experts
- Participate with SRE and Dev teams in the on-call rotation
- Mentor and champion SRE culture