Suki is looking to improve the reliability, scalability, performance, and security of its cloud-native platforms, which are powered by a Kubernetes-based microservices architecture, to support its AI voice solutions for healthcare and reduce the administrative burden on clinicians.
Requirements
- A strong background in backend software engineering (Python, Go, Java, or similar).
- Experience with cloud platforms, especially GCP, and Kubernetes in a production environment.
- A genuine passion for Site Reliability Engineering principles and a desire to automate everything.
- Excitement about leveraging AI to solve complex operational challenges and a curiosity to explore new frontiers in SRE tooling.
- Proven ability to troubleshoot complex distributed systems and drive incidents to resolution.
Responsibilities
- Driving AI-Powered Operations: Lead initiatives to integrate AI into our SRE workflows for proactive anomaly detection, intelligent incident response, predictive capacity planning, and automated remediation.
- Software Engineering for SRE: Develop robust automation, tooling, and services in Python, Go, or similar languages to improve operational efficiency, system observability, and developer experience.
- debug and analyze systemic problems, often stemming from customer-reported issues, to improve overall customer experience, identifying root causes and ensuring the right team drives the fix.
- Kubernetes & Microservices Mastery: Manage and optimize our Kubernetes-based microservices platform, ensuring seamless deployments, efficient resource utilization, and resilient operations.
- Incident & Problem Management: Take ownership of production incidents, drive root cause analysis (RCA), and implement preventative measures to continuously improve system uptime and performance and participate in oncall rotation.
- Security & Compliance Champion: Embed security best practices into our infrastructure, contributing to our HIPAA, SOC2, and future HITRUST compliance initiatives.
- Collaboration & Mentorship: Partner closely with commercial and engineering teams to ensure their services are production-ready, share SRE knowledge, and foster a culture of shared ownership for reliability.
Other
- Must be willing to work from our Redwood City office at least 3x per week
- This position requires a commitment to a hybrid work model, with the expectation of coming into the office a minimum of (3) three times per week.
- Relocation assistance is available for candidates willing to move to the Bay Area!
- Excellent communication skills and a collaborative spirit.