Microsoft Azure Compute team is looking to solve the problem of ensuring every Azure virtual machine achieves a service level agreement (SLA) of 99.99 percent or higher. This requires innovative thinking supported by data-driven decisions and intelligent automation, including the use of AI and machine learning for predictive failure models and proactive migration of virtual machines to reduce customer impact and improve platform resilience. The team is also exploring generative AI to enhance diagnostics, automate root cause analysis, and accelerate incident resolution.
Requirements
- coding in languages including, but not limited to, Rust, C, C++, C-Sharp, Java, JavaScript, or Python
- 2+ years technical engineering experience
- 1+ years of experience in designing, building and shipping high quality production software or services in cloud environment.
- 1+ years of experience in large-scale distributed systems analysis and troubleshooting.
- 1+ years of experience in high-stakes production online service environments.
- Demonstrated ability and passion for designing and building highly available distributed systems at scale.
- Demonstrated problem solving and debugging skills.
Responsibilities
- Partners with stakeholders across teams and organizations to determine project requirements and leads the design and architecture of change management features and services in Microsoft Azure Compute.
- Leverages expertise with stakeholders to develop project plans, release plans, and work items, while identifying dependencies and authoring design documents for features and services.
- Develops high-quality, extensible, and maintainable code and coaches others to follow best practices for software development.
- Supports live site operations as the Designated Responsible Individual (DRI), mentors engineers across products and solutions, and participates in on-call rotations to monitor systems for degradation, downtime, or interruptions.
- Proactively seeks new knowledge and adapts to emerging trends, technical solutions, and patterns to improve availability, reliability, efficiency, observability, and performance of products, while driving consistency in monitoring and operations at scale and sharing knowledge with other engineers.
- Collaborates with data scientists and machine learning engineers to design and integrate predictive models that detect hardware anomalies and trigger live migrations, leads initiatives to embed artificial intelligence–driven diagnostics and root cause analysis into availability services, and drives adoption of generative artificial intelligence tools to automate documentation, incident summaries, and engineering workflows.
- Partners with platform teams to build intelligent observability pipelines leveraging anomaly detection and trend analysis for early warning systems, and evaluates and integrates large-scale artificial intelligence models into control plane services to enable smarter, context-aware repair decisions across millions of Azure virtual machines.
Other
- Partners with stakeholders across teams and organizations to determine project requirements
- mentors engineers across products and solutions
- participates in on-call rotations
- Demonstrated ability to exercise sound judgment in ambiguous situations.
- Experience with agile methodologies is a plus, a willingness to adopt them is required.