Intuit is looking to deliver an always-on SAAS platform to its global payroll customers and needs a software engineer to lead operational excellence and site reliability.
Requirements
- 5+ years related experience with expert in one of the areas in site reliability (Automation, Monitoring tools, Cloud Operations)
- Hands-on experience in at least one of the modern scripting languages
- Deep understanding of AWS services, Kubernetes and Monitoring tools.
- Proficiency in one or more of the following: Go, Java or Python.
- Understanding of SSDLC and CI/CD pipelines.
- Passionate individual with ability to diagnose and resolve both pre-production and production issues.
- Have a passion for working on systems that are highly reliable, maintainable, scalable, and secure.
Responsibilities
- Responsible for driving operational excellence for the connected services that a business offers to its customers to deliver an 'always on' operation, year-round, at the right cost
- Adopt observability best practices with distributed tracing to reduce time to detect (MTTD) and time to resolve (MTTR).
- Creating or Enhancing monitoring capabilities leveraging AI assisted tools to increase alert accuracy, detect issues and resolve automatically.
- Navigate into Products offered to customers to gain deep understanding of product knowledge and influence the engineering culture in developing observable applications.
- Creation of runbooks for standard operating procedures for every production change.
- Develop FMEA and chaos engineering best practices backed with automation.
- Investing in Self-service capabilities to drive efficiencies with focus on reducing friction and manual steps.
Other
- Ability to deliver work incrementally to get feedback and iterate over solutions.
- Willingness to take initiative and unblock yourself to get things
- You are easy to work with: you communicate well, take feedback in a positive way and are OK not always doing the most glamorous tasks.
- Part of On-call rotation to respond to incoming alerts, triage and take necessary steps to minimize the impact.
- Contribute to infrastructure updates such as compute, storage, network and content changes.