The High Availability (HA) team at Microsoft is looking to improve the availability and redundancy of customer data in Substrate by exploring, owning, and improving next-generation storage solutions and leveraging AI tools to optimize the storage stack and enhance backend server performance.
Requirements
- 4+ years of software design and development experience with backend services.
- 4+ years of hands-on experience in any object-oriented coding language such as C++, C-Sharp, Java, or Python, or equivalent experience with C.
- Experience designing, implementing and supporting services in Cloud environment
- Experience deploying and maintaining large-scale distributed solutions
- Expertise in establishing and validating performance metrics for backend systems, including the design and execution of comprehensive testing strategies to assess solution scalability, reliability, and efficiency.
- Structured and methodical approach to software design, passion for building reliable and well-tested code.
Responsibilities
- Driving feature initiatives that shape the architecture and operational reliability of Exchange Online’s High Availability component, ensuring robust availability and redundancy across M365 backend servers at industry scale.
- Leading the Remote Spare Manager component, including automating monitoring and recovery of database redundancy, and providing redundancy reports and presentations for the Substrate Core Leadership Team at monthly service reviews.
- Developing production, monitoring, and test code; generating comprehensive reports; and performing in-depth performance analysis across the storage engine, database replication, and networking layers.
- Researching the underlying storage and networking layers to identify opportunities for throughput optimization and cost efficiency.
- Experimenting with and deliver next-generation storage solutions
- Leverage AI tools to further optimize the storage stack and enhance backend server performance.
- Implement Copilot agents to facilitate telemetry collection and alert mitigation.
Other
- 3 days / week in-office
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
- Microsoft Cloud Background Check
- Microsoft will accept applications for the role until October 6, 2025.