The partner company is looking to combine software engineering expertise with site reliability principles to build highly resilient, scalable, and secure systems that power critical AI-driven applications, ensuring optimal performance, availability, and reliability.
Requirements
- Strong backend software engineering experience in Python, Go, Java, or similar languages.
- Hands-on experience with cloud platforms, particularly GCP, and production Kubernetes environments.
- Demonstrated passion for Site Reliability Engineering principles and automation-first mindset.
- Proven ability to troubleshoot complex distributed systems and drive incident resolution effectively.
- Knowledge of AI applications in operational workflows is a plus.
- Familiarity with security, compliance, and monitoring standards in enterprise environments.
- Preferred: experience in healthcare technology, microservices architecture, and AI-driven operational tooling.
Responsibilities
- Lead the design, implementation, and operation of scalable, cloud-native infrastructure and microservices platforms.
- Develop automation, tooling, and services to enhance operational efficiency, system observability, and developer experience.
- Drive AI-powered SRE initiatives for anomaly detection, predictive capacity planning, incident response, and automated remediation.
- Take ownership of production incidents, perform root cause analysis, and implement preventative measures to improve uptime and performance.
- Manage Kubernetes-based deployments, ensuring reliable resource utilization, seamless scaling, and robust system resilience.
- Embed security and compliance best practices into infrastructure, contributing to HIPAA, SOC2, and other regulatory requirements.
- Collaborate with cross-functional teams, mentor junior engineers, and promote a culture of reliability, automation, and shared ownership.
Other
- Collaborate with cross-functional teams, mentor junior engineers, and promote a culture of reliability, automation, and shared ownership.
- Excellent communication, collaboration, and mentorship skills.
- Fully remote work flexibility with potential for hybrid arrangements.
- Opportunities to lead high-impact infrastructure and reliability projects.
- Professional growth, mentorship, and knowledge-sharing within a collaborative team.