The Software Engineering Site Reliability Engineer (SRE) is responsible for ensuring the reliability, scalability, and performance of software systems.
Requirements
- 7+ years of hands-on SRE experience (software development, systems monitoring) with Software Development experience (Java, golang, python)
- Experience building and operating high-availability, fault-tolerant, scalable, distributed software in production: Building monitoring, defining alerts, writing run books, establishing dashboards etc.
- Experience with monitoring and log aggregation frameworks, such as Azure Monitor/Sentinel, Datadog, Dynatrace, Elasticsearch, Kibana, Logstash.
- Experience with owning and maintaining software including the SDLC and deployment.
- Strong working knowledge of Docker, Kubernetes, Terraform, Chef or Ansible .
- Experience troubleshooting production applications driving mitigation and remediation.
Responsibilities
- System Monitoring and Troubleshooting: Monitoring the performance and availability of software systems, identifying and resolving issues, and implementing proactive measures to prevent future incidents.
- Automation and Infrastructure: Developing and maintaining automation tools and infrastructure to streamline software deployment, configuration management, and system monitoring.
- Performance Optimization: Analyzing system performance, identifying bottlenecks, and implementing optimizations to improve the efficiency and scalability of software systems.
- Incident Response and Root Cause Analysis: Responding to incidents, conducting root cause analysis, and implementing corrective actions to prevent similar incidents in the future.
- Collaboration with Development Teams: Collaborating with software development teams to ensure that reliability and scalability considerations are incorporated into the software design and implementation.
- Continuous Improvement: Identifying opportunities for process improvement, implementing best practices, and driving initiatives to enhance the reliability and performance of software systems.
- Develop Systems for Internal Developers: Identify areas that can be improved in the Software Development Lifecycle to remove cognitive overhead on developers and help them on the happy path towards developers sustainable, reliable, and resilient software utilizing industry standard practices
Other
- Hybrid: This role is categorized as hybrid. This means the successful candidate is expected to report to either Mountain View, CA, Austin, TX or Atlanta, GA at their respective innovation centers three times per week.
- BS/MS in Computer Science/Engineering preferred
- A company vehicle will be provided for this role with successful completion of a Motor Vehicle Report review.
- This job may be eligible for relocation benefits.