Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Microsoft Logo

Software Engineer II

Microsoft

Salary not specified
Sep 30, 2025
Remote, US
Apply Now

The Azure Compute team builds a fault-tolerant, distributed system on top of commodity datacenter hardware to deliver infrastructure for hosting cloud applications in virtual machines (VMs). The team creates the illusion that resources are limitless, infinitely elastic, and always available. The Availability Platform team within Azure Compute focuses on ensuring every Azure VM achieves an SLA of 99.99+%. Achieving and exceeding this target requires out-of-the-box thinking, backed by sound data-driven decisions and intelligent automation. The team owns services that monitor the health of millions of Azure machines and the control plane services that make all repair decisions in Azure. We leverage AI and machine learning to build predictive failure models that proactively live-migrate VMs before failures occur, minimizing customer impact and improving platform resilience. We are also exploring the use of generative AI to enhance diagnostics, automate root cause analysis, and accelerate incident resolution.

Requirements

  • coding in languages including, but not limited to, Rust, C, C++, C-Sharp, Java, JavaScript, or Python
  • 1+ years of experience in designing, building and shipping high quality production software or services in cloud environment.
  • 1+ years of experience in large-scale distributed systems analysis and troubleshooting.
  • 1+ years of experience in high-stakes production online service environments.
  • coding in languages including, but not limited to Rust , C, C++, C-Sharp, Java, JavaScript
  • coding in languages including, but not limited to Rust, C, C++, C-Sharp, Java, JavaScript, or Python
  • Demonstrated ability and passion for designing and building highly available distributed systems at scale.

Responsibilities

  • Leads the design and architecture of change management features and services in Azure Compute
  • Develops high quality, extensible, maintainable code and coaches others to do the same.
  • Supports livesite as Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
  • Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and shares knowledge with other engineers.
  • Collaborates with data scientists and ML engineers to design and integrate predictive models that proactively detect hardware anomalies and trigger live migrations, improving VM uptime and SLA compliance.
  • Leads initiatives to embed AI-driven diagnostics and root cause analysis into availability services, reducing time-to-resolution for incidents and improving operational efficiency.
  • Drives the adoption of generative AI tools to automate documentation, incident summaries, and engineering workflows, enhancing team productivity and knowledge sharing.

Other

  • Partners with appropriate stakeholders spanning across teams and orgs to determine project requirements
  • Leverages expertise with appropriate stakeholders to develop project plans, release plans, and work items.
  • Demonstrated problem solving and debugging skills.
  • Demonstrated ability to exercise sound judgment in ambiguous situations.
  • Experience with agile methodologies is a plus, a willingness to adopt them is required.