Microsoft is looking to design and deliver the next generations of AI training, AI inferencing, virtual desktop, video and gaming infrastructure for Azure, facing challenges across a wide spectrum of hardware architectures, network types and processor types. The goal is to define and drive an end-to-end vertical view, defining approaches and strategies across large organizations with continuous focus on customer value, quality, performance and automation.
Requirements
- 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C-Sharp, Java, JavaScript, or Python OR equivalent experience.
- 3+ years of experience developing tools or software systems for distributed cloud computing environments and/or HPC/AI infrastructure
- 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C-Sharp, Java, JavaScript, OR Python OR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C-Sharp, Java, JavaScript, or Python OR equivalent experience.
- 1+ years experience in data science, including work with machine learning models, statistical analysis, or data-driven product development, is strongly preferred.
Responsibilities
- Designing and delivering the next generations of AI training, AI inferencing, virtual desktop, video and gaming infrastructure for Azure.
- Defining and driving an end-to-end vertical view, defining approaches and strategies across large organizations.
- Defining, deploying and sustaining hardware and software Azure infrastructure for AI and other GPU-based workloads.
- Focusing on hardware/software interaction, coding and playing with next-gen hardware, end-to-end systems engineering anywhere in the infrastructure - from fiber networking, switches, GPU differentiation, rack design, cluster design and more.
- Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI).
- Acts as a Designated Responsible Individual (DRI) and guides other engineers by developing and following the playbook, working on call to monitor system/product/service for degradation, downtime, or interruptions, alerting stakeholders about status and initiates actions to restore system/product/service for simple and complex problems when appropriate
- Evaluate and make recommendations that advance Azure infrastructure for AI and other GPU-based workloads.
Other
- Willing to dive deeply into any level or layer of a problem.
- Willing to learn emerging technologies, from hardware to software.
- Leads by example within the team by producing extensible and maintainable.
- Maintains communication with key partners across the Microsoft ecosystem of engineers.
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.