Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Staff Software Engineer - Site Reliability

Intuit

Salary not specified

Oct 24, 2025

San Diego, CA, US

Intuit's Identity Team is looking for a Site Reliability / DevOps Engineer to build and operate large-scale, secure, fault-tolerant, performant, highly available, affordable, and scalable cloud-native microservices based systems operating on Kubernetes & AWS. The goal is to improve the efficiency and speed of delivering high-quality secure software while ensuring reliability and scalability for critical identity services.

Requirements

10+ years of experience in developing and operating complex distributed software systems in an enterprise cloud native environment (AWS preferred).
Strong AWS development and deployment knowledge, GCP a plus.
Demonstrated experience operating high scale and high availability services in the cloud.
Demonstrated experience in designing highly resilient services and building recovery mechanisms.
Experience using AI to solve complex operational and auto healing problems.
Developed infrastructure as code (Terraform/CDK preferred), CI/CD pipelines using Jenkins, Circle CI, Cloud Builder, Docker, Kubernetes, ECS
Coding in Python, Java, Go or other similar languages combined with strong operational skills

Responsibilities

Act as the technical subject matter expert to evaluate and evangelize forward-looking processes, tools technologies and architecture to help deliver high-quality secure software faster and more efficiently while meeting availability, scale & performance requirements in a AWS public cloud and Kubernetes environment.
Design and develop self-recovery mechanisms and tools for massive scale platforms to enable faster and automatic recovery.
Design and develop observability components for massive scale platforms, to detect issues quickly and isolate the problem as part of fast recovery.
Contribute to the cost and capacity management for platform components, uncovering cost saving opportunities and developing automation to enforce them.
Build self-service tools to enable platform consumers to troubleshoot and triage issues in a scalable manner.
Contribute to FMEA (Failure Mode Effective Analysis) and Chaos Engineering for critical platform components, identifying resiliency gaps and preparing the team for faster recovery from production incidents.
Continuously evolve development practices and operational maturity through structured root cause analysis and monitoring.

Other

Actively evolve the system / infrastructure target state working with a cross-functional team from Architecture, Product Management, and Production Operations.
Be a part of the roadmap and strategy for the Operational Excellence, Resiliency and Cost Optimization charters for Identity platform capabilities.
Troubleshooting complex issues and managing stakeholders' expectations during incidents.
Participate in 12/7 on-call rotations.
Supporting and coaching other engineers, pair programming or peer reviewing code, helping to ensure that all engineers are growing and part of a community. Be a role model to engineers and inspire a high technical bar for the team