Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Optimal Dynamics Logo

Staff Software Engineer, Site Reliability (SRE)

Optimal Dynamics

$160,000 - $200,000
Aug 28, 2025
Remote, US
Apply Now

Optimal Dynamics is looking to lead reliability across its production platform to ensure high availability and drive smarter, data-driven operations at scale.

Requirements

  • Deep, hands-on experience with infrastructure at scale, cloud, containerization, and more::
  • AWS (multi‑service)
  • ECS and/or Kubernetes containerization workloads
  • CICD & IaC (Terraform)
  • Production Networking/Fundamentals
  • Python Proficient: You can read/review service code and land operational improvements.
  • Data Driven: In your approach to SLOs, capacity, performance, and cost efficiency with strong observability chops

Responsibilities

  • Own the company‑wide incident lifecycle: standards for detection, escalation, incident command, customer comms, and high‑quality postmortems with action tracking.
  • Define and drive SLIs/SLOs for core services; build guardrails and dashboards that make reliability visible and actionable.
  • Lead production readiness reviews, capacity/performance planning, load testing, disaster recovery exercises, and resilience engineering (failure testing/chaos where appropriate).
  • Level‑up on‑call: right‑sizing rotations, paging hygiene, runbooks, auto‑remediation, and continuous improvement of MTTA/MTTR.
  • Embed security into the delivery pipeline: dependency and image scanning, least‑privilege/IAM baselines, secrets management, and service‑to‑service auth.
  • Partner with Engineering leadership to maintain SOC 2‑aligned controls as code; make audit‑friendly evidence generation part of everyday engineering.
  • Build and evolve paved roads for deploys, config, and runtime operations in our monorepo (Bazel) and CI/CD (AWS CodePipeline/CodeBuild).

Other

  • Staff‑level IC who has led reliability programs at meaningful scale and owned incident response standards.
  • Influential: Able to shape direction and create simple, durable standards
  • Communicative: Excels in both technical and interpersonal communication, with strong written and verbal skills
  • Aware of FinOps (cost attribution, efficient scaling) and DR/BCP program experience.
  • Familiar with secure SDLC, threat modeling, and compliance automation in a SOC 2 context.