In Azure Specialized, collaboratively working to bring the next generation of workloads to the Public Cloud platform, enabling end to end new scenarios for Azure customers, and designing and delivering the next generations of AI training, AI inferencing, virtual desktop, video and gaming infrastructure for Azure.
Requirements
- Coding in languages including, but not limited to, C, C++, C-Sharp, Java, JavaScript or Python
- 2+ years of experience in HPC or Machine Learning.
- Familiarity with Deep Learning, AI Infrastructure, Accelerators.
- 1+ years experience on Distributed Systems, High Performance Computing / Machine Learning middleware, Co-Designing Hardware-Software, Profiling and Performance Analysis Tools.
- Experience with hardware/software interaction, coding and playing with next-gen hardware, end-to-end systems engineering anywhere in the infrastructure - from fiber networking, switches, gpu differentiation, rack design, cluster design and more.
- Experience with Azure infrastructure and Azure platform
- Experience with test-driven engineering culture to reduce regressions and bugs in production
Responsibilities
- Willing to dive deeply into any level or layer of a problem.
- Willing to learn emerging technologies, from hardware to software. Evaluate and make recommendations that advance Azure infrastructure for AI and other GPU-based workloads.
- Leads by example within the team by producing extensible and maintainable code. Optimizes, debugs, refactors, and reuses code to improve performance and maintainability, effectiveness, and return on investment (ROI).
- Drives identification of dependencies and the development of design documents for a product, application, service, or platform.
- Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI).
- Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate.
- Help ensure Azure platform is consistent on performance, can scale on-demand, and engineered to withstand the unparalleled computing demand from the customer workloads.
Other
- Bachelor's Degree in Computer Science or related technical field
- Ability to meet Microsoft, customer and/or government security screening requirements
- Travel 0-25%
- 0 days/week in-office - remote
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.