Seeking an experienced Observability and Monitoring Engineer to build and mature enterprise-wide monitoring, logging, alerting, and observability capabilities across AWS-based technology stack
Requirements
- Hands-on experience with monitoring/logging tools such as Zabbix, Graylog, Splunk, SolarWinds, or equivalents
- 5+ years of hands-on experience with AWS services and architecture
- Deep understanding of metrics, logs, traces, distributed tracing, and event correlation
- Experience building dashboards and KPIs for application, infrastructure, and database layers
- Strong scripting/automation skills (Python, Bash, PowerShell) and familiarity with Terraform or CloudFormation
- AWS Services (EC2, RDS, S3, Lambda, ECS/EKS, etc.)
- Monitoring Tools (Dynatrace, CloudWatch, Zabbix, Solarwinds, Graylog etc.)
Responsibilities
- You will establish standards for logs, metrics, traces, event correlation, and alert across multiple environments
- You will build centralized dashboards and alerting policies that provide unified visibility across: applications & services, operating systems, AWS services (EC2, RDS, Lambda, S3, CloudWatch, CloudTrail, etc.), databases (MS SQL Server, PostgreSQL, etc.), file transfer systems (SFTP, managed transfer tools), batch jobs and scheduled processes
- You will create actionable and noise-free alerting thresholds, escalation policies, and runbooks
- You will integrate existing tools (Dynatrace, Graylog, Splunk, SolarWinds, Zabbix) into a cohesive ecosystem
- You will rationalize tool usage and recommend consolidation or modernization where appropriate
- You will manage the lifecycle, configuration, tuning, and health of monitoring and logging platforms, automate monitoring deployments using IaC (CloudFormation) and CI/CD pipelines, and develop reusable templates/standards so teams can onboard new applications quickly
- You will define SLOs/SLIs and reliability KPIs for critical services
Other
- Bachelor's degree in Computer Science or related field
- 5+ years of experience implementing monitoring and observability using Dynatrace
- Familiarity with ITIL incident/problem management processes
- Proficiency with AI tools and using them responsibly in improving observability preferred
- Experience with container orchestration and microservices architecture preferred