Hexaware is seeking an experienced Site Reliability Engineering (SRE) Lead to drive reliability, scalability, and observability across their services
Requirements
- Proven experience in SRE/DevOps roles with responsibility for production reliability and observability
- Prior experience leading or mentoring engineering teams
- Strong Python experience, particularly for server-side code, automation, and operational tooling
- Hands-on expertise with Datadog: metrics, APM/tracing, logs, synthetics, dashboards, and alerting
- Deep understanding of observability concepts and best practices (SLIs/SLOs, tracing, contextual logging)
- Solid experience with container platforms and orchestration (Docker, Kubernetes)
- Experience with CI/CD systems and pipelines (e.g., GitHub Actions, Jenkins, CircleCI, GitLab CI)
Responsibilities
- Lead the SRE function: set technical direction, define best practices, and coach engineers on reliability and operational excellence
- Establish and maintain SLOs/SLIs, alerting policies, and error budgets in partnership with product and engineering teams
- Design, implement, and improve observability: metrics, traces, logs, dashboards, and runbooks (Datadog as primary tool)
- Automate operations to reduce toil: CI/CD pipelines, automated rollouts, self-healing mechanisms, and runbook automation
- Own incident management: lead incident response, coordinate cross-team communications, drive blameless postmortems and remediation
- Drive capacity planning, performance tuning, and disaster recovery planning for Python server applications and services
- Manage tooling and infrastructure: container orchestration, infrastructure-as-code, secrets management, and monitoring integrations
Other
- Degree in Computer Science, Engineering, or equivalent practical experience
- Typically 5+ years in SRE/DevOps roles and 2+ years in a lead or senior position (flexible for exceptional candidates)
- Excellent communication skills and the ability to influence cross-functional teams
- Ability to work in a hybrid environment (2-3 days onsite in a week)
- Equal Opportunities Employer: Hexaware Technologies is an equal opportunity employer