Skyhigh Security is looking to solve the problem of maintaining a high availability production environment and improving the operational aspects of systems, such as monitoring, alerting, incident response, and vendor interactions.
Requirements
- System admin experience on Linux environments.
- Experience with end-to-end monitoring setup for infra and applications
- Experience with Prometheus, Grafana, ELK, Opensearch, Cloudwatch, PagerDuty and other monitoring tools.
- Solid experience with Cloud Technologies such as AWS and OCI.
- Good experience with containerized workloads tools like Kubernetes.
- Network knowledge (TCP/IP, UDP, DNS, Load balancing) and prior network administration experience is required.
- Experience with BGP, NAT, TCP/IP, iBGP, Proxies, Cross connects.
Responsibilities
- Perform Incident Management and Change Management to maintain the continuous availability of all Cloud Infrastructure services.
- Ensure all SRE and operating procedures are maintained and executed.
- Maintain a 24x7 production environment with a high level of service availability and perform quality reviews, manage operational issues.
- Perform root cause analysis for major incidents and drive the process by involving required stakeholders.
- Perform problem management by analyzing metrics, alarms and dashboards to troubleshoot problem areas, report issues to assist in performance tuning and fault finding.
- Implementation of proactive monitoring, alerting, trend analysis, and self-healing solutions.
- Explore and innovate new technologies, features, and tools to improve the platform and automate operational tasks using Bash, Python or any other programming language.
Other
- Bachelor’s degree in computer science, electrical engineering or a related area, with 7+ years of SRE experience in a large enterprise organization
- Ability to work a flexible work schedule in a 24 x 7 environment with rotational shifts
- Strong communication and analytical/problem-solving skills.
- Systematic approach and to drive problems to resolution.
- Paid Time Off