The company is seeking a Principal Observability Architect to lead the strategic architecture, evolution, and operationalization of a modern, multi-tenant Observability Platform-as-a-Service (OPaaS) tailored for a hybrid on-prem and cloud-native SaaS product.
Requirements
- Deep expertise in OpenTelemetry (including collector deployment, semantic conventions, sampling strategies).
- Experience integrating observability in Kubernetes, microservices, and serverless ecosystems.
- Hands-on with telemetry data pipelines using Cribl, Prometheus/VictoriaMetrics, and log/trace platforms.
- Experience embedding telemetry validation in CI/CD workflows.
- Familiarity with AI/ML for observability (anomaly detection, summarization, impact correlation).
- Working knowledge of data privacy, retention, and compliance practices in observability.
- Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving.
Responsibilities
- Lead architecture and roadmap for a multi-region, multi-cloud, multi-tenant observability platform scalable across diverse customer environments and service boundaries.
- Architect near real-time telemetry ingestion pipelines with low-latency guarantees (seconds) using a mix of streaming and batch processing technologies.
- Define observability blueprints including telemetry SLAs, data contracts, tenant data isolation, and cost-aware retention strategies for high-cardinality data.
- Ensure observability systems are cloud-native and container-aware, supporting environments built on Kubernetes, service meshes, and serverless components.
- Design and implement real-time metrics, logs, traces, and event pipelines with technologies such as: VictoriaMetrics, Prometheus, Grafana, Alertmanager, Cribl Stream and Edge for dynamic routing and filtering, VictoriaLogs for structured log analysis.
- Embed real-time anomaly detection and signal correlation, with context-aware alerting to reduce noise and MTTR.
- Standardize OpenTelemetry instrumentation across all services with prebuilt SDKs, language libraries, and semantic conventions.
Other
- 10+ years in DevOps, SRE, or Observability roles, including 5+ years in architecture or platform engineering.
- Proven experience designing and operating near real-time observability systems in global-scale SaaS environments.
- Lead cross-functional collaboration with SRE, Platform, Security, and Engineering teams to evolve observability maturity.
- Define and document observability patterns, anti-patterns, and escalation workflows.
- Drive internal R&D around OpenTelemetry, AI in observability, high-cardinality telemetry, and eBPF-based observability tooling.