Erie Insurance is seeking to enhance the reliability, availability, and performance of cloud-based AI/ML environments to support enterprise-scale AI initiatives within the IT organization.
Requirements
- Advanced knowledge of one or more cloud platforms (e.g., AWS, Azure, GCP), with deep expertise in managing cloud infrastructure.
- Proficiency in Infrastructure as Code (IaC) tools (e.g., Terraform, CloudFormation, CDK, ARM, Pulumi), with a focus on the configuration and management of cloud resources.
- Expertise in CI/CD pipelines and automation tools (e.g., Jenkins, GitLab, Ansible), and scripting languages (e.g., Python, Bash, Powershell) to streamline operational tasks.
- Proficiency with monitoring and observability tools (e.g., CloudWatch, Prometheus), with experience in cloud incident response and troubleshooting.
- Ability to execute and improve operational procedures, including system maintenance, patching, recovery, and monitoring.
- Understanding of operational standards, controls, and compliance requirements, ensuring that cloud environments meet necessary regulations.
- Ability to identify operational inefficiencies and propose strategic solutions to optimize system performance and availability.
Responsibilities
- Lead incident response and provide on-call support for escalated cloud incidents and deployments, conducting root cause analysis, ensuring timely resolution, and implementing improvements to prevent future incidents.
- Develop and maintain automated operational procedures, such as system maintenance, monitoring, compliance, patching, and recovery, to enhance cloud service reliability and reduce manual intervention.
- Collaborate with cross-functional teams and leadership, providing operational insights to improve cloud infrastructure reliability, availability, and performance.
- Monitor and optimize cloud service performance, identifying issues proactively and leading continuous improvement efforts to enhance system reliability and operational efficiency.
- Mentor and guide junior engineers, providing technical leadership and promoting best practices in cloud operations, automation, and incident response.
- Lead operational improvement initiatives, including chaos engineering practices, operational drills, and testing activities to improve response, resiliency, and detection capabilities.
- Ensure cloud environments adhere to operational standards, security, and compliance requirements, managing operational readiness, disaster recovery, and performance process and procedures.
Other
- Bachelor’s degree in computer science, engineering, or equivalent industry experience in a related technical field; and five years of professional experience in a related technical field; or Associate’s degree and seven years of experience; or High School degree and nine years of experience, required.
- Associate-level cloud certification (such as AWS Certified Cloud Solutions Architect - Associate) preferred or willingness to obtain within 6 months of hire.
- Ability to move over 50 lbs using lifting aide equipment; Rarely.
- Climbing/accessing heights; Rarely.
- Driving; Occasional (<20%).