Decagon is looking to solve the problem of delivering concierge customer experiences at scale through their conversational AI platform. The Infrastructure team specifically needs to build and operate the foundational systems that power this platform, ensuring high performance, reliability, and scalability for AI agents and customer interactions.
Requirements
- 3+ years building and operating production infrastructure at scale.
- Depth in at least one area across Core/Data/AI‑ML/Platform/Voice, with curiosity to learn the rest.
- Proven track record meeting high availability and low latency targets (owning SLOs, p95/p99, and load testing).
- Excellent observability chops (OpenTelemetry, Prometheus/Grafana, Datadog) and incident response (PagerDuty, SLO/error budgets).
- Strong Kubernetes experience (GKE/EKS/AKS) and experience across multiple cloud providers (GCP, AWS, and Azure)
- Experience with customer‑managed deployments
Responsibilities
- Design and implement critical infrastructure services with strong SLOs, clear runbooks, and actionable telemetry.
- Partner with research and product teams to architect solutions, set up prototypes, evaluate performance, and scale new features.
- Tune service latencies: optimize networking paths, apply smart caching/queuing, and tune CPU/memory/I/O for tight p95/p99s.
- Evolve CI/CD, golden paths, and self‑service tooling to improve developer velocity and safety.
- Support various deployment architectures for customers with robust observability and upgrade paths.
- Lead infrastructure‑as‑code (Terraform) and GitOps practices; reduce drift with reusable modules and policy‑as‑code.
- Participate in on‑call and drive down toil through automation and elimination of recurring issues.
Other
- Clear written communication and the ability to turn ambiguous requirements into simple, reliable designs.
- Experience being an early backend/platform/infrastructure engineer at another company
- In-office company
- Customers are everything
- Relentless momentum