The company is looking to improve system reliability and performance through enhanced observability, proactive monitoring, and efficient incident response.
Requirements
- Minimum three (3) years of experience with Observability and Orchestration (New Relic preferred)
- Minimum three (3) years of experience with Configuration Management and Automation tools (Ansible and Terraform preferred)
- Minimum three (3) years of experience with Monitoring and Telemetry tools.
- Harness and Gearset experience (Preferred)
- Three (3) years of JavaScript experience (Preferred)
- Three (3) years CI/CD experience (Preferred)
Responsibilities
- Design, implement, and maintain observability platforms (logging, metrics, tracing, alerting) to ensure system reliability and performance.
- Develop and optimize dashboards, visualizations, and reports to provide actionable insights to engineering and operations teams.
- Configure and manage monitoring tools (e.g., Prometheus, Grafana, New Relic, Datadog, Elastic, Splunk) for real-time visibility into applications and infrastructure.
- Define and track key Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to measure system health and performance.
- Collaborate with developers, SREs, and platform teams to instrument applications and services for observability (e.g., distributed tracing, structured logging).
- Establish and maintain automated alerting and incident response workflows to reduce MTTR (Mean Time to Recovery).
- Automate observability infrastructure provisioning and configuration through Infrastructure as Code (IaC) tools (e.g., Terraform, Ansible, Helm).
Other
- Minimum three (3) years of progressive relevant industry experience.
- Ability to interact professionally with a variety of institutions.
- Excellent written and verbal communication skills.
- Ability to work independently and within a team.
- Desire to grow knowledge and skill set through on-the-job training, formal classroom training and independent research.