The Senior Manager, SRE will be responsible for leading a team of Site Reliability Engineers in ensuring the reliability, performance, and operational support of our supply chain systems.
Requirements
- Mastery of an object oriented programming language (preferably Java)
- Proven experience in reliability reviews, performance engineering, and destructive testing.
- Strong understanding of production engineering and operational support practices.
- Mastery of a modern scripting language (preferably Python)
- Mastery of a modern web application framework such as Ruby on Rails, Spring MVC, and Node.js
- Mastery of writing SQL queries against a relational database
- Proficient in effective troubleshooting and issue resolution techniques
Responsibilities
- Conduct reliability reviews to identify areas for improvement and implement solutions to enhance system reliability.
- Implement and promote performance engineering practices to ensure optimal system performance.
- Develop and execute strategies for destructive testing to identify potential points of failure and improve system resilience.
- Oversee production engineering efforts to ensure systems are designed for operational excellence and reliability.
- Provide leadership in incident management and root cause analysis to resolve production issues and prevent recurrence.
- Establish and maintain operational support practices, including monitoring, alerting, and incident response.
- Writes custom code or scripts to automate infrastructure, monitoring services, and test cases
Other
- Lead and mentor a team of Site Reliability Engineers, fostering a culture of continuous improvement and innovation.
- Collaborate with cross-functional teams to ensure alignment on reliability and performance goals.
- The Sr. Manager must exhibit the ability to lead managers and their teams and drive change management and process improvement.
- Typically requires overnight travel 5% to 20% of the time.
- Excellent leadership and team management skills.