The business and/or technical problem this role aims to solve is to ensure the scalability, reliability, and performance of systems, while also improving operational efficiency and preventing future incidents through proactive monitoring, troubleshooting, and automation.
Requirements
- Strong understanding of cloud services (AWS, GCP, Azure).
- Proficiency in scripting languages (Python, Bash, etc.).
- Experience with containerization and orchestration (Docker, Kubernetes).
- Familiarity with monitoring and logging tools (Prometheus, Grafana, ELK stack).
- Experience with CI/CD pipelines.
- Knowledge of infrastructure as code (Terraform, Ansible).
- Understanding of networking and security best practices
Responsibilities
- Design, implement, and manage scalable and reliable systems.
- Monitor system performance and troubleshoot issues.
- Collaborate with development teams to improve the reliability and performance of applications.
- Implement automation tools and frameworks to improve operational efficiency.
- Conduct post-mortem analyses and develop strategies to prevent future incidents.
- Participate in on-call rotation and respond to incidents as they arise.
- Develop and maintain documentation for systems and processes.
Other
- BA/BS degree and 0-2 years’ relevant experience OR equivalent combination of education and experience
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- Experience in a Site Reliability Engineer or related role.
- Excellent problem-solving skills and ability to work under pressure.
- This role may require access to export-controlled commodities and technology. Therefore, to conform to U.S. export control regulations, applicant should be eligible for any required authorizations from the U.S. Government.