Palo Alto Networks runs a large hybrid infrastructure and is one of the largest GCP customers. As a Site Reliability Engineer, you will be part of a team supporting the services running on this infrastructure. This includes automation, architecture, performance, metrics, troubleshooting, security, and reliability.
Requirements
- Expertise in configuration management with a framework such as Ansible, Terraform, Helm, Kubernetes
- Proficient in Python and/or Go
- Expertise in managing applications in the Kubenetes cluster with autoscaling enabled
- Experience in Production Engineering, DevOps, or Site Reliability
- Expertise in the public cloud (GCP or AWS), especially in GCP
- Strong Linux administration, internals, and network troubleshooting
- Proficiency with programming languages like Python, Golang, and shell scripting to automate tasks
Responsibilities
- Design, build, and operate reliable, secure Cloud infrastructure
- Ensure that applications are production-ready, scalable, and reliable
- Develop tools and automation frameworks
- Automate robust deployment of robust services
- Orchestrate end-to-end monitoring and alerting
- Lead root cause analysis of critical business and production issues
- Participate in design reviews
Other
- BS or MS in Computer Science, a related field, or equivalent professional experience or equivalent military experience
- Experience with CI/CD pipelines, GitLab, and GitHub preferred
- Ability to diagnose and troubleshoot complex distributed systems handling high-volume transactions
- Excellent written and verbal communication, able to collaborate and rally support
- Self-disciplined, self-managed, self-motivated, and strong sense of ownership, urgency, and drive
- Passion for infrastructure and monitoring as code
- Ready to understand and dissect new technology stacks quickly