Counsel is seeking to scale its infrastructure to meet the demands of a growing user base and healthcare clients, ensuring high uptime and low latency for its AI-native medical platform.
Requirements
5+ years building production software, with 2+ years deploying production systems in a cloud environment.
Proven experience building infrastructure that handles millions of requests with strict uptime and latency requirements
Proficiency with Terraform or similar tools for automated infrastructure management
Proven experience with: Secure container deployment and Kubernetes best practices on AWS EKS
Proven experience with: Real-time queue systems (e.g. Kafka, SQS, Temporal)
Opinionated about improving developer experience through tooling, automation, and workflows that accelerate and up-level the engineering team around you
Responsibilities
Scale our infrastructure and deployments, meet the demands of hundreds of thousands patients maintaining fast response time and 99.99% uptime.
Plan and lead key infrastructure investments, ensuring we can scale while balancing short-term delivery with long-term maintainability
Drive development of core data infrastructure, including our data transformation and warehouse systems, integrations with health information networks, and data ingestion pipelines with enterprise clients
Improve developer experience and engineering velocity, building the tooling, CI/CD pipelines, and observability systems that let teams move faster with confidence
Champion operational and security excellence, including reliability, monitoring, on-call practices, and infrastructure security aligned with healthcare standards (HIPAA, SOC2)
Other
This role is fully remote.
You consider yourself a swiss army knife engineer who can flex into infrastructure problems related to backend services, data pipelines, and web products.
You care deeply about performance and feel immense satisfaction driving latency of critical services down
You bring an SRE’s toolkit of experience, with strong opinions on how to invest in on-call and incident management practices to ensure system reliability.
You have hands-on experience scaling early systems from 1 → 10 in a fast-paced, high-growth environment