Optimal Dynamics is looking to lead reliability across its production platform to ensure high availability and drive smarter, data-driven operations at scale.
Requirements
- Deep, hands-on experience with infrastructure at scale, cloud, containerization, and more::
- AWS (multi‑service)
- ECS and/or Kubernetes containerization workloads
- CICD & IaC (Terraform)
- Production Networking/Fundamentals
- Python Proficient: You can read/review service code and land operational improvements.
- Data Driven: In your approach to SLOs, capacity, performance, and cost efficiency with strong observability chops
Responsibilities
- Own the company‑wide incident lifecycle: standards for detection, escalation, incident command, customer comms, and high‑quality postmortems with action tracking.
- Define and drive SLIs/SLOs for core services; build guardrails and dashboards that make reliability visible and actionable.
- Lead production readiness reviews, capacity/performance planning, load testing, disaster recovery exercises, and resilience engineering (failure testing/chaos where appropriate).
- Level‑up on‑call: right‑sizing rotations, paging hygiene, runbooks, auto‑remediation, and continuous improvement of MTTA/MTTR.
- Embed security into the delivery pipeline: dependency and image scanning, least‑privilege/IAM baselines, secrets management, and service‑to‑service auth.
- Partner with Engineering leadership to maintain SOC 2‑aligned controls as code; make audit‑friendly evidence generation part of everyday engineering.
- Build and evolve paved roads for deploys, config, and runtime operations in our monorepo (Bazel) and CI/CD (AWS CodePipeline/CodeBuild).
Other
- Staff‑level IC who has led reliability programs at meaningful scale and owned incident response standards.
- Influential: Able to shape direction and create simple, durable standards
- Communicative: Excels in both technical and interpersonal communication, with strong written and verbal skills
- Aware of FinOps (cost attribution, efficient scaling) and DR/BCP program experience.
- Familiar with secure SDLC, threat modeling, and compliance automation in a SOC 2 context.