Microsoft Azure is seeking a Service Engineer II for Live Site Issues, Problem Management and driving Customer reliability space. This role is accountable for enhancing the customer experience across Azure, including First Party Services. The ideal candidate will demonstrate strong breadth in managing complex, highly available services, paired with deep technical expertise in Azure Core Services and their inter dependencies.
Requirements
- Proven experience in cloud operations, incident & crisis management, or large-scale systems engineering ideally within platforms such as Azure, AWS, or GCP.
- Demonstrated experience in 24×7×365 enterprise environments, managing mission-critical services.
- Demonstrated experience implementing AI-driven solutions and automation, with proficiency in one or more programming/automation languages (e.g., C, C++, C, Java, JavaScript, Python) or equivalent expertise.
- ITIL, SRE, or other industry-recognized technical and operational certification.
- 1+ year(s) technical experience working with large-scale cloud or distributed systems.
- 3+ Years of demonstrated experience as an Incident Management or Crisis Management for critical, high-severity incidents in high-availability, distributed environments.
- Deep understanding of cloud architecture patterns, High Availability, Disaster Recovery, Business Continuity, Performance Tuning for service platform services.
Responsibilities
- Lead and manage high-severity incidents across Azure services, serving as the single point of accountability to ensure rapid detection, triage, resolution, and customer communication.
- Act as the central authority during live site incidents, driving real-time decision-making and coordination across Engineering, Support, PM, Communications, and Field teams.
- Contribute to the design of V. Next architecture for Cloud infrastructure services, based on Customer/ First party engagements.
- Engage in major production triage efforts and work with different teams in the identification of root cause of highly impactful or complex issues as required and identify Product gaps and work with Product teams to bridge the gaps.
- Partner closely with Software developers, Product Managers, architects, and Infrastructure teams to drive delivery of sustainable and reusable design solution patterns to ensure non-functional production support requirements are adopted early in the Migration /Deployment
- Analyze customer-impacting signals from telemetry, support cases, and feedback to identify root causes, drive incident reviews (RCAs/PIRs), and implement preventative service improvements.
- Drive continuous improvement of the Azure platform by incorporating learnings from live site events and customer feedback, ensuring improved reliability, observability, and supportability.
Other
- Are you passionate about cloud computing, obsessed with customer experience, and driven to resolve complex issues under pressure?
- Do you thrive in high-stakes, live environments and want to play a pivotal role in ensuring the reliability of Microsoft’s cloud platform?
- Success in this role requires the ability to influence and collaborate across many Azure servicing teams to ensure customer needs are met.
- Strong communication skills—both written and verbal—are essential.
- Exhibit strong cross-team collaboration, engineering mindset, and results-oriented execution under pressure