RH is looking for a principal SRE Engineer to provide strategic support and execute infrastructure, security, continuous integration, deployment, and IT operations practices, scaling and metrics, as well as running day-to-day operations of production and development infrastructure for a cloud-based commerce /enterprise platform.
Requirements
- Obsess about site reliability and performance, and ways to continuously improve the same
- Own and lead initiatives to define, design, and implement solutions that help prevent issues impacting availability/performance and reduce time to resolution
- Understand the overall ecomm architecture and identify opportunities to optimize with an eye on availability/performance
- Identify and execute on automation opportunities in the context of code deployment, problem identification, and resolution.
- Act as a subject matter expert on SRE/DevOps best practices with Cloud Formation, Auto Scaling Groups, Build tools, Monitoring, and Configuration Management.
- Perform analysis of best practices and emerging concepts in DevOps, Infrastructure Automation, Akamai configuration management, and Enterprise Security
- Continuously improve observability capabilities (e.g., Prometheus, Grafana, Splunk) to ensure the right leading indicators are monitored and appropriate response workflows are set up.
Responsibilities
- Provide strategic support and execute infrastructure, security, continuous integration, deployment, and IT operations practices, scaling and metrics, as well as running day-to-day operations of production and development infrastructure for a cloud-based commerce /enterprise platform.
- Work closely with the Development and QA teams to continuously improve existing features and roll out new services, ensuring the high availability of our platform.
- Fix code, write tests, debug, and ship features.
- Define, design, and implement solutions that help prevent issues impacting availability/performance and reduce time to resolution.
- Identify and execute on automation opportunities in the context of code deployment, problem identification, and resolution.
- Continuously improve observability capabilities (e.g., Prometheus, Grafana, Splunk) to ensure the right leading indicators are monitored and appropriate response workflows are set up.
- Create technical documentation and maintain CI/CD pipeline ( Jenkins)
Other
- If you possess a "can do" attitude, are driven by research and problem-solving, and thrive on challenges, this opportunity will interest you.
- You’re comfortable with infrastructure and configuration, but also happy to roll up your sleeves, fix code, write tests, debug, and ship features.
- BS/MS (MS preferred) in Computer Science or equivalent work experience
- 4+ years experience supporting mission critical workloads like ecommerce in a distributed architecture environment
- Solid technical know-how and proven record of problem-solving in a distributed architecture setting
- Excellent critical thinking skills with demonstrated compelling work ethic
- Solid team player with the ability to collaborate cross-functionally with tech and business
- Excellent communication skills; demonstrated ability to explain complex technical issues to technical and non-technical audiences; owns a collaborative, partnership mentality.