Splunk is looking to hire a Manager, Site Reliability Engineering to lead and manage engineers who operate highly available, scalable, and cost-efficient applications with low operational burden by handling and improving the reliability and resiliency of services and infrastructure for their cloud-native systems in FedRAMP environments.
Requirements
- 8+ years of experience in handling large-scale cloud-native microservices platforms.
- 2+ years of strong hands-on management experience managing teams deploying, handling, and monitoring large-scale Kubernetes clusters in the public cloud specifically AWS or GCP
- Experience with and leading a team in infrastructure automation and scripting using Python and/or Golang.
- Strong hands-on experience in monitoring tools such as Splunk, Prometheus, Grafana, ELK stack, etc. in order to build observability for large-scale microservices deployments.
- Experience with deployment, operations, and performance management of one or more of the following large-scale clusters such as Cassandra, Kafka, Elastic Search, MongoDB, ZooKeeper, Redis, etc.
- Excellent problem-solving, triaging, and debugging skills in large-scale distributed systems
- Experience with Infrastructure-as-Code using Terraform, CloudFormation, Google Deployment Manager, Pulumi, Packer, ARM, etc.
Responsibilities
- Manage a team working on reliability projects, including: HA, Business Continuity Planning, disaster recovery, backup/restore, RTO, RPO
- Chaos engineering
- Application uptime and performance
- Capacity management & planning
- SLIs, SLOs, error budgets, and monitoring dashboards
- Responsible for deployment and operations of large-scale distributed data stores and streaming services
- Establishing design patterns for monitoring and benchmarking
Other
- Lead a team of super smart engineers who are passionate about large scale distributed systems forSplunk Cloud ObservabilityinFedRAMPenvironments.
- Manage across the organization to deliver quality products that delightSplunk's passionate users.
- Mentor and grow teams of tight-knit engineers who are building a state-of-the-art,cloud-based environment for massive-scale data processing.
- Partner with our Talent Acquisition team as we recruit, interview and hire the best engineering talent to join Splunk's growing SRE FedRAMP team!
- Manage engineers to achieve more than they thought possible. You enjoy managing and driving teams to success and are fulfilled through the success of others.