Ensure services are reliable, scalable, and efficient by leveraging cutting-edge AI tools to redefine operational practices and build powerful self-service tools to empower engineering teams.
Requirements
- Proven experience as a Site Reliability Engineer, DevOps Engineer, or in a similar software engineering role.
- Strong proficiency in a programming language such as Python or Go.
- Experience building self-service tools (e.g., internal web portals, Slack integrations) to improve developer productivity and reduce operational toil.
- Deep understanding of the principles of SLIs, SLOs, and SLAs and experience implementing them.
- Hands-on experience with incident management protocols and participating in on-call rotations.
- Familiarity with using AI/ML tools in an operational context for tasks like log analysis, anomaly detection, or automated remediation.
- Proficiency with cloud platforms (GCP, AWS, or Azure) and container orchestration tools (Kubernetes, Docker).
Responsibilities
- Design, build, and maintain automation solutions to handle everything from provisioning and deployment to failure detection and remediation.
- Build and maintain user-friendly self-service tooling—including internal web portals, Slack bots, and automated JIRA workflows—to streamline developer and operational tasks.
- Establish and manage Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to meet and exceed our Service Level Agreements (SLAs).
- Act as a key leader during production incidents, driving resolution, and conducting blameless postmortems to prevent future occurrences.
- Utilize AI-powered tools for advanced observability, anomaly detection, predictive alerting, and automating complex operational tasks to enhance system reliability.
- Collaborate with engineering teams to design and implement scalable, highly available, and secure infrastructure.
Other
- We are looking for a proactive and innovative Site Reliability Engineer (SRE) to join our growing team.
- If you are passionate about building resilient systems and automating everything, we want to hear from you!
- A strong problem-solving mindset and a passion for continuous improvement and learning.