Microsoft is looking to optimize fleet health and reduce offline capacity across hyperscale environments for Azure Storage, driving intelligent solutions to improve reliability, minimize operational overhead, and enable scalable Artificial Intelligence (AI) and Machine Learning (ML) workloads for customers.
Requirements
- Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C-Sharp, or Python
- 2+ year(s) of deep understanding of server hardware architecture and fleet-level hardware lifecycle management, including diagnostics, telemetry, and failure mitigation.
- 3+ years of technical background in cloud infrastructure, storage systems. Preferably within hyperscale environments.
- Experience with AI/ML-driven automation for anomaly detection, predictive maintenance, or system optimization.
- Experience driving engineering excellence, including service quality, reliability, and operational readiness.
- Demonstrated comfort working across time zones in a remote-friendly, globally distributed team environment.
- Coding experience in languages including, but not limited to, C, C++, C-Sharp, or Python
Responsibilities
- Lead and manage a high-performing engineering team focused on scaling and optimizing Azure Storage’s global fleet infrastructure.
- Drive planning and execution of team deliverables, ensuring alignment with partner teams, business goals, technical strategy, and service-level objectives.
- Develop and deliver scalable features that reduce offline capacity, improve fleet reliability, and minimize manual operational overhead and risk.
- Leverage AI/ML to build intelligent automation for anomaly detection, predictive maintenance, and fleet health optimization.
- Engage with senior leadership, including VP-level stakeholders, to influence roadmap priorities and communicate impact.
- Guides team to drive multiple group's project plans, release plans, and work items in coordination with appropriate stakeholders (e.g., project managers).
- Guides team and acts as an expert for Designated Responsible Individual (DRI) and monitors other engineers across product lines, working on call to monitor system/product/service for degradation, downtime, or interruptions.
Other
- 2+ years of people management experience.
- 3+ years of demonstrated ability to plan and execute complex projects, including setting priorities, managing timelines, and delivering results across cross-functional teams.
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
- Ability to engage Vice President (VP)-level leadership and influence technical roadmaps.
- Bachelor's Degree in Computer Science, or related technical discipline