Microsoft Azure Specialized is looking to design and deliver the next generations of AI training, AI inferencing, virtual desktop, video and gaming infrastructure for Azure, facing challenges across a wide spectrum of hardware architectures, network types and processor types. The team aims to provide end-to-end vertical solutions with a continuous focus on customer value, quality, performance, and automation, while also expanding capacity and supported scenarios for 100X growth.
Requirements
- coding in languages including, but not limited to, C, C++, C-Sharp, Java, JavaScript or Python
- 2+ years of experience in HPC or Machine Learning.
- Familiarity with Deep Learning, AI Infrastructure, Accelerators.
- 1+ years experience on Distributed Systems, High Performance Computing / Machine Learning middleware, Co-Designing Hardware-Software, Profiling and Performance Analysis Tools.
Responsibilities
- Designing and delivering the next generations of AI training, AI inferencing, virtual desktop, video and gaming infrastructure for Azure.
- Defining, deploying and sustaining hardware and software Azure infrastructure for AI and other GPU-based workloads.
- Focusing on hardware/software interaction, coding and working with next-gen hardware.
- Performing end-to-end systems engineering anywhere in the infrastructure - from fiber networking, switches, gpu differentiation, rack design, cluster design and more.
- Producing extensible and maintainable code.
- Optimizing, debugging, refactoring, and reusing code to improve performance and maintainability, effectiveness, and return on investment (ROI).
- Creating, implementing, optimizing, debugging, refactoring, and reusing code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI).
Other
- Willing to dive deeply into any level or layer of a problem.
- Willing to learn emerging technologies, from hardware to software.
- Leads by example within the team by producing extensible and maintainable code.
- Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate.
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.