Developing and operating a large-scale software platform that requires observability solutions to provide insights into its health and performance.
Requirements
- 7+ years of production-level experience in one of: Go, Python, Java, Scala, Rust, C++, or similar languages
- Experience in software development, in large-scale distributed systems
- Experience with cloud technologies, e.g. AWS, Azure, GCP, Docker, or Kubernetes
- Familiarity with observability infrastructure, monitoring patterns, and reliability practices
Responsibilities
- Building the next generation of observability platforms that support billions of active time series and process petabytes of logs daily
- Managing infrastructure across nearly a hundred cloud regions
- Developing advanced workflows that accelerate incident diagnosis
- Upleveling monitoring and reliability practices across Databricks engineering
- Developing opinionated tools that set common standards for managing structured logs, metrics, alerts, dashboards, and oncall rotations
Other
- BS (or higher) in Computer Science, or a related field
- Experience driving large projects involving multiple teams
- Mentor and uplevel engineers, fostering a culture of technical excellence within the team and broader observability community