The company needs to improve the quality and reliability of its services by providing clear visibility about system health, intelligence, and actionable insights.
Requirements
- Deep knowledge of monitoring, alerting, and logging systems and tools, such as Prometheus, Grafana, Elastic Stack, Datadog, or New Relic.
- Familiarity with distributed tracing technologies, such as Jaeger or Zipkin.
- Experience with cloud-based infrastructure, including AWS, Azure, or Google Cloud Platform.
- Strong understanding of DevOps and SRE practices, including continuous integration, continuous delivery, and infrastructure as code (IaC).
- Proficiency in scripting languages, such as Python, Bash, or Ruby.
- Experience with containerization and orchestration technologies, such as Docker and Kubernetes.
- Familiarity with application performance management (APM) tools, such as Dynatrace or AppDynamics.
Responsibilities
- Design, implement, and maintain observability solutions such as monitoring, alerting, logging, and tracing across various platforms, applications, and infrastructure.
- Collaborate with cross-functional teams to identify and define observability requirements.
- Develop and implement best practices for creating and maintaining effective monitoring, alerting, and telemetry systems.
- Evaluate and recommend industry-leading observability tools and technologies to improve system visibility and reliability.
- Define and track key performance indicators (KPIs) and service-level objectives (SLOs) related to system availability, performance, and reliability.
- Assist in the troubleshooting and resolution of complex incidents and problems by analyzing data from observability tools.
- Conduct ongoing evaluations of observability systems and identify opportunities for improvements and optimizations.
Other
- Bachelor's Degree in Computer Science, Engineering, or a related technical field.
- Excellent communication and collaboration skills, with the ability to work with teams across different functions and technical domains.
- Strong problem-solving and analytical skills, with a focus on data-driven decision-making.
- A proven track record of leading and delivering successful observability projects and initiatives.
- Medical/Dental/Vision/Life, AD&D insurance
- Flexible Spending Accounts (FSA) & Health Savings Account (HSA)
- Long-term/Short-term Disability
- Employee Assistance Program (EAP) program
- 401K Plan with Company Match
- 18-21 days of the Paid Time Off (PTO) a year based on the tenure
- 12 Public Holidays
- Paid Parental leave
- Pre-tax commuter benefits
- MTV - [Free] Electric Car Charging Station