Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Intuit Logo

Staff Software Engineer - Site Reliability

Intuit

Salary not specified
Oct 24, 2025
San Diego, CA, US
Apply Now

Intuit's Identity Team is looking for a Site Reliability / DevOps Engineer to build and operate large-scale, secure, fault-tolerant, performant, highly available, affordable, and scalable cloud-native microservices based systems operating on Kubernetes & AWS. The goal is to improve the efficiency and speed of delivering high-quality secure software while ensuring reliability and scalability for critical identity services.

Requirements

  • 10+ years of experience in developing and operating complex distributed software systems in an enterprise cloud native environment (AWS preferred).
  • Strong AWS development and deployment knowledge, GCP a plus.
  • Demonstrated experience operating high scale and high availability services in the cloud.
  • Demonstrated experience in designing highly resilient services and building recovery mechanisms.
  • Experience using AI to solve complex operational and auto healing problems.
  • Developed infrastructure as code (Terraform/CDK preferred), CI/CD pipelines using Jenkins, Circle CI, Cloud Builder, Docker, Kubernetes, ECS
  • Coding in Python, Java, Go or other similar languages combined with strong operational skills

Responsibilities

  • Act as the technical subject matter expert to evaluate and evangelize forward-looking processes, tools technologies and architecture to help deliver high-quality secure software faster and more efficiently while meeting availability, scale & performance requirements in a AWS public cloud and Kubernetes environment.
  • Design and develop self-recovery mechanisms and tools for massive scale platforms to enable faster and automatic recovery.
  • Design and develop observability components for massive scale platforms, to detect issues quickly and isolate the problem as part of fast recovery.
  • Contribute to the cost and capacity management for platform components, uncovering cost saving opportunities and developing automation to enforce them.
  • Build self-service tools to enable platform consumers to troubleshoot and triage issues in a scalable manner.
  • Contribute to FMEA (Failure Mode Effective Analysis) and Chaos Engineering for critical platform components, identifying resiliency gaps and preparing the team for faster recovery from production incidents.
  • Continuously evolve development practices and operational maturity through structured root cause analysis and monitoring.

Other

  • Actively evolve the system / infrastructure target state working with a cross-functional team from Architecture, Product Management, and Production Operations.
  • Be a part of the roadmap and strategy for the Operational Excellence, Resiliency and Cost Optimization charters for Identity platform capabilities.
  • Troubleshooting complex issues and managing stakeholders' expectations during incidents.
  • Participate in 12/7 on-call rotations.
  • Supporting and coaching other engineers, pair programming or peer reviewing code, helping to ensure that all engineers are growing and part of a community. Be a role model to engineers and inspire a high technical bar for the team