Arista Networks is looking for Site Reliability Engineers to ensure the reliability and scalability of their internal systems and infrastructure, which support a large engineering team and the development of industry-leading routing and switching products. The goal is to improve the development experience by maintaining and enhancing the systems used for building, testing, and deploying software.
Requirements
- Knowledge of one or more of Go, Python, Javascript, Shell Scripting.
- Knowledge of Linux (or UNIX).
- Experience operating and managing software systems at scale
- Strong understanding of the fundamentals of storage and networking
- Comfortable with Ansible and GitOps
- Applied understanding of software engineering principles.
- Strong problem solving and software troubleshooting skills.
Responsibilities
- Proactively monitor, respond to, and enhance alerts
- Build automated responses to the most common alerts or work with the rest of the EngProd team to build them
- Create and maintain the incident response runbooks working with the service dev teams
- Debug and resolve issues impacting developer user experience and infrastructure stability
- Develop patterns to support system reliability and socialize them within the EngProd team
- Review and contribute to the specifications and implementations written by other team members.
- Work with Arista’s software engineers to identify bottlenecks and limitations in our workflows, tooling, and infrastructure and provide fixes for those problems.
Other
- At least BS Computer Science or Engineering + 5 years’ experience, MS Computer Science or Engineering + 3 years’ experience, or equivalent work experience.
- Ability to design a solution and implement features independently.
- Ability to work in small teams.