ClickHouse is looking for an experienced engineer to join their Observability team to build and operate the telemetry platform that powers both internal monitoring and the observability features their customers rely on. The systems ingest trillions of events per day with sustained throughput in the tens of millions per second, requiring a reliable, scalable, and efficient platform.
Requirements
- 5+ years building and running production systems at scale
- Proficiency in Golang
- Experience with Kubernetes, Helm, ArgoCD, and Terraform or similar IaC tools
- Comfortable working with at least one major cloud provider (AWS, GCP, Azure)
- Experience with OpenTelemetry, Prometheus, Grafana, or similar tools
- Experience with ClickHouse preferred
Responsibilities
- Design, build, and operate distributed systems that power observability across ClickHouse Cloud
- Own reliability, performance, and cost-efficiency of our telemetry pipeline and storage systems
- Take part in the on-call rotation and help drive root-cause resolution and long-term fixes
- Build tooling and automation to eliminate repetitive operational work
- Help shape the roadmap for observability by identifying bottlenecks and scaling challenges
- Collaborate with other engineering teams to improve their observability posture
- Contribute to design discussions, architecture reviews, and mentor teammates
Other
- Strong bias for action and ownership — you ship, fix, and improve systems proactively
- Great production debugging skills and a problem-solving mindset
- Strong communication skills; comfortable working in a remote, async-friendly team
- Experience balancing system performance, reliability, and cost
- Ability to iterate quickly: build MVPs, collect feedback, and improve continuously