McAfee is looking for an SRE engineer to maintain service levels (availability, latency, and reliability) for their customer-facing services, reduce friction in managing change, and ensure the security, performance, cost, and compliance of their services.
Requirements
- Experience with Monitoring, logging, APM & other tools: APMs. Grafana, CloudWatch, etc.
- Experience with CI/CD tools: Git, Jenkins, Harness, etc.
- Experience with container technologies: Kubernetes, Docker
- Experience with both Windows and Linux Operating Systems
- Strong knowledge of AWS cloud service offerings covering serverless and containerized workloads
- Able to Monitor, Debug & RCA for any service failures.
- Experience maintaining and operating production systems (> 99.95% SLA) on Cloud.
Responsibilities
- Responsible for proactive monitoring of mission critical production environment and respond quickly in response to breach in trends or issues.
- Troubleshoot, debug, and escalate issues with proper analysis to concerned teams to ensure maximum availability.
- Troubleshoot problems in real-time, interacting with DevOps/Engineering and internal support representatives to deliver maximum customer satisfaction.
- Detect and triage of all operational incidents and requests.
- Work across Engineering and Support teams to ensure we meet our goals for service reliability, availability, and efficiency.
- Ensure security events and alerts are addressed in a timely manner.
- Own availability and performance of mission critical services. Automation to prevent problem recurrence, and responses to all non-exceptional service conditions.
Other
- This is a Hybrid position located in Frisco, TX. You will be required to be onsite on an as-needed basis, typically 1 to 6 times a month.
- We are only considering candidates within a commutable distance to one of the two locations and are not offering relocation assistance at this time.
- Exceptional communication skills that cross both team and geographical boundaries
- Advanced knowledge and skills within a specific technical or professional discipline with understanding of the impact of work on other areas of the organization.
- Ability to work some non-standard hours to support a global team and initiatives.