Intuit's Identity Team is looking for a Site Reliability / DevOps Engineer to build and operate large-scale, secure, fault-tolerant, performant, highly available, affordable, and scalable cloud-native microservices based systems operating on Kubernetes & AWS. The goal is to improve the efficiency and speed of delivering high-quality secure software while ensuring reliability and scalability for critical identity services.
Requirements
- 10+ years of experience in developing and operating complex distributed software systems in an enterprise cloud native environment (AWS preferred).
- Strong AWS development and deployment knowledge, GCP a plus.
- Demonstrated experience operating high scale and high availability services in the cloud.
- Demonstrated experience in designing highly resilient services and building recovery mechanisms.
- Experience using AI to solve complex operational and auto healing problems.
- Developed infrastructure as code (Terraform/CDK preferred), CI/CD pipelines using Jenkins, Circle CI, Cloud Builder, Docker, Kubernetes, ECS
- Coding in Python, Java, Go or other similar languages combined with strong operational skills
Responsibilities
- Act as the technical subject matter expert to evaluate and evangelize forward-looking processes, tools technologies and architecture to help deliver high-quality secure software faster and more efficiently while meeting availability, scale & performance requirements in a AWS public cloud and Kubernetes environment.
- Design and develop self-recovery mechanisms and tools for massive scale platforms to enable faster and automatic recovery.
- Design and develop observability components for massive scale platforms, to detect issues quickly and isolate the problem as part of fast recovery.
- Contribute to the cost and capacity management for platform components, uncovering cost saving opportunities and developing automation to enforce them.
- Build self-service tools to enable platform consumers to troubleshoot and triage issues in a scalable manner.
- Contribute to FMEA (Failure Mode Effective Analysis) and Chaos Engineering for critical platform components, identifying resiliency gaps and preparing the team for faster recovery from production incidents.
- Continuously evolve development practices and operational maturity through structured root cause analysis and monitoring.
Other
- Actively evolve the system / infrastructure target state working with a cross-functional team from Architecture, Product Management, and Production Operations.
- Be a part of the roadmap and strategy for the Operational Excellence, Resiliency and Cost Optimization charters for Identity platform capabilities.
- Troubleshooting complex issues and managing stakeholders' expectations during incidents.
- Participate in 12/7 on-call rotations.
- Supporting and coaching other engineers, pair programming or peer reviewing code, helping to ensure that all engineers are growing and part of a community. Be a role model to engineers and inspire a high technical bar for the team