Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Microsoft Logo

Service Engineer

Microsoft

$84,200 - $180,400
Aug 19, 2025
Redmond, WA, USA
Apply Now

Microsoft Azure is seeking Service Engineers to ensure the reliability and customer experience of its cloud platform, addressing complex live-site issues and driving improvements in incident management and service resilience.

Requirements

  • Proven experience in cloud operations, incident and crisis management, or large-scale systems engineering, ideally within platforms such as Azure, AWS, or GCP.
  • Demonstrated success managing mission-critical services in 24×7×365 enterprise environments.
  • Experience implementing AI-driven solutions and automation, with proficiency in one or more programming or scripting languages (e.g., C, C++, C, Java, JavaScript, Python) or equivalent expertise.
  • Solid understanding of Windows or Linux platforms and developer tools, with the ability to diagnose cloud platform issues and apply AI-driven approaches to enhance reliability.
  • Knowledge of cloud architecture patterns, including high availability, disaster recovery, business continuity, and performance optimization.
  • Familiarity with monitoring and observability tools such as Azure Monitor, Watch Dog, Grafana, Prometheus, Datadog, Splunk, or New Relic.
  • Exposure to chaos engineering, fault injection, or resilient architecture design.

Responsibilities

  • Lead and manage high-severity incidents across Azure services, serving as the single point of accountability to ensure rapid detection, triage, resolution, and customer communication.
  • Act as the central authority during live site incidents, driving real-time decision-making and coordination across Engineering, Support, PM, Communications, and Field teams.
  • Engage in major production triage efforts and work with different teams in the identification of root cause of highly impactful or complex issues as required and identify Product gaps and work with Product teams to bridge the gaps.
  • Partner closely with Software developers, Product Managers, architects, and Infrastructure teams to drive delivery of sustainable and reusable design solution patterns to ensure non-functional production support requirements are adopted early in the Migration /Deployment
  • Analyze customer-impacting signals from telemetry, support cases, and feedback to identify root causes, drive incident reviews (RCAs/PIRs), and implement preventative service improvements.
  • Drive continuous improvement of the Azure platform by incorporating learnings from live site events and customer feedback, ensuring improved reliability, observability, and supportability.
  • Collaborate closely with Engineering and Product teams to influence and implement service resiliency enhancements, auto-remediation tools, and customer-centric mitigation strategies.

Other

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
  • Experience leading incident or crisis response for high-severity events in highly available, distributed systems.
  • Ability to provide clear direction and foster collaboration among internal stakeholders and external partners during complex situations.
  • Ability to make strategic decisions under pressure, demonstrating leadership, analytical thinking, and teamwork across diverse groups.