Ensuring the availability, reliability, and performance of customer-facing software applications by creating highly scalable and fault-tolerant systems.
Requirements
- Proficiency with enterprise system monitoring software (examples: NewRelic, Nagios, Solarwinds, Dynatrace, Datadog, Azure Monitor, Splunk)
- Experience with cloud-based infrastructure, databases, and applications
- Experience with performance tuning and fault finding in large-scale distributed systems.
- Experience with designing, implementing, and managing performance testing practices, including specific tools and frameworks
- Knowledge of disaster recovery planning and execution.
- Strong understanding of coding, automation, and engineering principles to build resilient, self-healing systems
- Familiarity with DevOps practices and tools
Responsibilities
- Ensure the high availability and reliability of the production environment by monitoring system health and performance
- Provide primary operational support for large-scale distributed software applications
- Facilitate incident resolution via triage, communication, engagement, escalation, and documentation
- Partner with platform administration (both internal and external) to define and achieve stability and scalability objectives
- Collaborate with technical and quality teams to improve services by identifying areas of risk and helping to define and proactively implement solutions
- Drive continual improvement in system performance by setting service level objectives in collaboration with a performance center of practice and/or product development teams
- Analyze and publish metrics from operating systems and applications to assist in performance tuning and fault finding
Other
- Develops customer-facing and internal documentation on best practices, troubleshooting flowcharts, training materials and FAQs to ensure consistent customer experience.
- Takes ownership of the escalated cases from Associate Engineers and Engineers and takes it to the resolution.
- Ability to effectively work in a highly matrixed organization
- Possess a customer-centric mindset
- Excellent oral and written communication