The company is facing a critical, manual, and hands-on operational burden that is essential to their customers' success. They need to systematically dismantle this operational burden through automation, tooling, and systems.
Requirements
- strong background in Site Reliability Engineering or a closely related DevOps function
- strong command of Linux systems administration
- understanding of networking fundamentals (TCP/IP, DNS, routing)
- production experience with a major cloud provider, preferably AWS
- proficient in its core concepts and services (VPC, EC2, IAM, S3)
- experience building and managing infrastructure as code with tools like Terraform
- hands-on experience instrumenting applications and managing the telemetry pipelines for metrics, logs, and traces
Responsibilities
- manual deployments
- reactive troubleshooting
- on-call escalations
- write code, build tooling, and deploy declarative infrastructure to eliminate the most critical operational burdens
- act as a primary stakeholder, providing clear requirements to our internal tooling and platform teams and ensuring their solutions meet the operational need
- define and implement Service Level Objectives (SLOs)
- influence the design of new products for operability
Other
- experience working directly with external customers to solve difficult technical problems
- Your communication must be clear, empathetic, and precise.
- We value compassion
- We value humility
- We value reliability