Unlock unprecedented productivity by embedding data and intelligence at the core of every business process
Requirements
- Minimum of 5 years of experience building and maintaining cloud-based software applications with at least one public cloud platform (AWS, Azure, or GCP)
- Proficiency in Java, the Spring framework, and Python (or a similar scripting language) in a Linux environment
- Prior experience contributing to Site Reliability Engineering initiatives or similar operational roles
- Knowledge of SRE principles, including SLI/SLO design, error budgets, and toil reduction strategies
- Proven expertise in developing and operating production-grade, scalable services using Kubernetes and elastic cloud architectures
- Strong problem-solving and troubleshooting abilities in complex, distributed systems
Responsibilities
- Lead reliability efforts for a fleet of 80+ FedRAMP-compliant microservices running on Kubernetes, applying SRE principles to drive observability, automation, and incident prevention
- Own high-priority application incident escalations, performing deep technical analysis and restoration within defined SLOs, while continuously improving detection and response mechanisms
- Engineer solutions to enhance the availability, latency, and performance of production services—automating manual processes to eliminate toil and scale operational efficiency
- Collaborate closely with platform and application engineering teams to conduct post-incident reviews, extract insights, and implement systemic changes that improve overall reliability
- Document operational knowledge and runbooks, embedding SRE best practices into onboarding, incident response, and platform architecture standards
Other
- Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related technical field (or equivalent hands-on experience)
- Excellent written and verbal communication skills in English
- Please note: This position is not eligible for immigration visa sponsorship, now OR in the future