Gallagher is seeking a Manager of Observability Engineering to build and maintain robust monitoring systems, ensuring the reliability, performance, and scalability of their infrastructure by developing cutting-edge observability practices across cloud-native and on-premises environments.
Requirements
- Hands-on experience with observability tools such as Nagios, Grafana, the ELK stack (Elasticsearch, Logstash, Kibana), Splunk, Dynatrace, SolarWinds, or equivalent platforms.
- Proficiency in scripting or programming languages like Python, Perl, or Bash for automation and tool development.
- Familiarity with cloud environments (AWS, Azure, Google Cloud) and container orchestration tools (Docker, Kubernetes).
- Experience with infrastructure-as-code tools (Terraform, Ansible) and CI/CD pipeline management.
- Strong analytical and problem-solving capabilities with a proactive approach to troubleshooting and resolving issues.
- A solid understanding of security best practices, monitoring network/system security, and knowledge of performance optimization techniques and tools.
- Proven ability to design effective alerting systems and manage clear escalation processes.
Responsibilities
- Design and Implement Observability Solutions: Develop and maintain comprehensive monitoring, logging, tracing, and alerting systems that cover both cloud-native and on-premises environments.
- Collaborate with Cross-Functional Teams: Work closely with development, DevOps, and SRE teams to integrate observability best practices throughout all phases of the software development and operations lifecycle.
- Create Effective Dashboards and Visualizations: Build and optimize intuitive dashboards that offer actionable insights into system performance, health, and key metrics, ensuring that teams have access to real-time data.
- Develop Automated Alerts and Incident Management Workflows: Implement automated alerting mechanisms and incident response processes designed to detect issues early and resolve them proactively, ultimately minimizing customer impact.
- Optimize Observability Platforms: Continually evaluate and refine observability tools and platforms to ensure they remain efficient, scalable, and user-friendly.
- Integrate with Various Observability Tools: Manage integrations with popular observability platforms—such as Dynatrace, Grafana, the ELK Stack, Splunk, Datadog, and New Relic—to create a cohesive, unified monitoring strategy.
- Lead Root Cause Analysis and Post-Incident Reviews: Facilitate in-depth investigations after incidents, conduct thorough root cause analyses, and recommend actionable improvements to prevent future recurrences.
Other
- Education: Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience.
- Experience: A minimum of 3+ years in observability engineering or a similar role, demonstrating a strong grasp of observability concepts including monitoring, logging, tracing, and alerting.
- Communication & Collaboration: Excellent interpersonal skills with the ability to communicate effectively and collaborate with cross-functional teams.
- THIS ROLE WILL BE BASED OUT OF COLOMBIA.
- Advocate for Observability and Reliability: Champion a company-wide culture that prioritizes system observability, reliability, and performance, ensuring these principles are embedded in every project and initiative.