Microsoft's Azure Data engineering team is looking to enhance the service reliability of their Azure SQL Database and Managed Instance services by building a reliable service using the right data, telemetry, monitoring, observability, and automation tools. The goal is to make data-driven decisions to efficiently monitor and improve service reliability.
Requirements
- 2+ years of experience designing, developing reliable engineering systems, and/or infrastructure.
- Experience in writing design specifications and technical documentations for communications and knowledge sharing.
- Experience in leading and managing software projects with DevOps tools.
- Experience in using debugging tools such as Windbg, Visual Studio, Xperf, and KQL to debug user dumps or live applications.
- Demonstrated troubleshooting skills in SQL Server/Azure SQL Database with deep understanding in one or more of the following areas including Provisioning(Control Plane), Query Processing, Connectivity, High Availability, SQL Operating System (SQL OS), Storage Engine, Backup/Restore, and Replication.
- Solid understanding of Windows Operating System level concepts such as processes, threading, memory allocation, and the network stack; understanding of how applications are affected by the above, and ability to debug same.
Responsibilities
- Identify opportunities and implement automation solutions for efficient management of live-site incidents by leveraging data-driven decision-making process.
- Serve as a subject matter expert in monitoring and troubleshooting Azure SQL Database and Managed Instance services.
- Own, triage, investigate, and resolve service issues with emphasis on broad communications, learning, and teaching throughout the process.
- Drive initiatives and projects utilizing proficient project management and communication skills.
- Embody our culture and values
Other
- 3 days / week in-office
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
- This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
- Guidance and driving component teams in building a reliable service using the right data, telemetry, monitoring, observability, and automation tools.
- Make data-driven decisions to efficiently monitor and improve service reliability with good project management and communication skills.