Microsoft Azure is the fastest-growing business in Microsoft's history and serves as the foundation of Microsoft's commercial cloud services. Our team, Azure Core, builds and manages the core platform that supports a wide range of services. As a Principal Software Engineer - Azure Core, you will have an exciting opportunity to innovate and shape the future of computing, and we encourage you to apply and learn more.
Requirements
- 8+ years technical engineering experience with coding in languages including, but not limited to, Go, Rust, Bash, or Python
- 5+ year(s) experience building and managing data centers.
- Bootstrapping and managing data center (DC) infrastructure, including device inventory, diagnosis, and repairs.
- Networking and security expertise in high-performance computing, Remote Direct Memory Access (RDMA) over InfiniBand or RDMA over RoCE and eBPF.
- Driver and firmware lifecycle management, including GPU diagnostics.
- Storage and acceleration technologies for AI workloads, including distributed storage systems for multi-exabyte AI workloads and high-throughput data pipelines.
- 1+ year(s) experience with Artificial Intelligence (AI) and Machine Learning (ML) job scheduling and orchestration at scale, using technologies such as Simple Linux Utility for Resource Management (SLURM), Ray, and Kueue.
Responsibilities
- Provides technical leadership for the identification of dependencies and the development of design documents for a product, application, service, or platform.
- Leads by example and mentors others to produce extensible and maintainable code used across the company.
- Leverages deep subject-matter expertise of cross-product features with appropriate stakeholders (e.g., project managers) to lead multiple product's project plans, release plans, and work items.
- Holds accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
- Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and shares knowledge with other engineers.
Other
- Partners with appropriate stakeholders to determine user requirements for one or more complex scenarios.
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
- Model training optimization for performance and scalability.
- 1+ year(s) experience improving model serving and inference efficiency, ensuring low latency and high throughput for production workloads.