Support the reliability, performance, and availability of critical applications and infrastructure
Requirements
- Hands-on experience with at least one observability/monitoring tool from the following: Dynatrace, Catchpoint, ELK, OpenTelemetry, Splunk, AppDynamics, Datadog, SolarWinds, AWS CloudTrail, Grafana, or LightStep
- 9+ years of experience as a Go+ Python developer
- Strong understanding of logs, metrics, traces, and distributed systems monitoring
- Experience with cloud platforms (AWS, Azure, or GCP) and cloud-native observability practices
- Familiarity with CI/CD pipelines, automation, and Infrastructure-as-Code (IaC)
Responsibilities
- Implement, configure, and maintain observability tools to monitor application performance, infrastructure health, and system availability
- Collect, analyze, and visualize metrics, logs, and traces to identify issues, trends, and opportunities for optimization
- Collaborate with engineering, DevOps, and operations teams to define monitoring strategies and ensure proactive detection of incidents
- Develop dashboards, alerts, and reports to provide real-time visibility into system performance and user experience
- Troubleshoot and resolve performance bottlenecks, application errors, and infrastructure issues using observability insights
- Ensure integration of observability solutions with CI/CD pipelines and cloud-native environments
Other
- Need Ex Capital One / Discover Experience
- W2-Contract Only; Kindly note that applications on a C2C basis will not be considered for this role
- Excellent problem-solving skills and ability to perform root cause analysis in complex environments
- Strong collaboration and communication skills to work effectively across technical teams