Cleerly is revolutionizing how heart disease is diagnosed, treated, and tracked through value-based, AI-driven precision diagnostic solutions with the goal of helping prevent heart attacks. The company is seeking a Site Reliability Engineer to ensure the health and integrity of their enterprise-level imaging platform, focusing on repeatable deployments and the stability of new product streams within AWS.
Requirements
- 6–10+ years of professional experience running and managing production services on AWS.
- Deep understanding of core AWS fundamentals, including VPC networking, IAM, KMS, security groups, and routing.
- Expertise with Infrastructure-as-Code (Terraform, CDK, or CloudFormation) and reliable environment replication.
- Experience operating and managing container platforms (EKS/ECS) and/or scalable managed services.
- Proven ability to design and automate comprehensive CI/CD pipelines (builds, tests, deploys, and rollbacks).
- Deep knowledge of metrics, logs, and traces, along with setting SLOs, configuring robust alerting, and managing structured incident response processes.
- Practical High Availability (HA) / Disaster Recovery (DR) thinking, including backup strategies, multi-AZ patterns, and conducting failure drills.
Responsibilities
- Stand up and harden the new Hub cloud environment and deployment pipeline, ensuring reliability, security, and repeatability.
- Design, develop, and manage cloud infrastructure using AWS services, Terraform (Infrastructure as Code), and Docker containers.
- Use strong system administration and network engineering skills to ensure the reliability, scalability, and performance of all platform systems.
- Own observability and incident readiness end-to-end, including third-party connectivity patterns, runtime guardrails, and defining upgrade strategies (canary/rollback).
- Implement DevOps methodologies and tools, facilitating Continuous Integration (CI), Continuous Delivery (CD), and the automation of infrastructure management tasks.
- Develop and maintain automation tools to proactively reduce manual operational tasks (toil).
- Ensure system and network security is always maintained by implementing and enforcing appropriate security measures across the platform.
Other
- A team player with an appetite for hands-on work.
- A highly motivated self-starter who is detail oriented.
- Demonstrates strong ownership, accountability, and commitment to high-quality deliverables.
- Bachelor’s degree in computer science, Information Technology, or a related field, or equivalent experience.
- Proven experience in Site Reliability Engineering, DevOps, or a similar role.