Microsoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team is looking to enable customers in deploying, monitoring, profiling, and debugging their applications on hyperscale cloud infrastructure
Requirements
- Coding in languages including, but not limited to, C, C++, C-Sharp, Java, JavaScript, or Python
- 5+ years of experience in the system design space
- Deep knowledge of AI systems and architectures for both training and inference SKUs across mult-vendor and multi-generational GPUs and models
- Experience with reliability modeling, lifecycle modeling and analysis of workloads and GPUs, GPU planning, analytical design of systems and workload assignment
- Knowledge of deep LLM modeling
- Experience with software-hardware codesign features
- Familiarity with hyperscale cloud infrastructure
Responsibilities
- Partners with appropriate stakeholders to determine user requirements for a set of scenarios
- Leads identification of dependencies and the development of design documents for a product, application, service, or platform
- Leads by example and mentors others to produce extensible and maintainable code used across products
- Leverages subject-matter expertise of cross-product features with appropriate stakeholders to drive multiple group's project plans, release plans, and work items
- Holds accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions
- Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products
- Work on all aspects of inference and training systems focusing on system design, data center planning and modeling of workloads on multiple GPU SKUs
Other
- Bachelor's Degree in Computer Science or related technical field
- 6+ years technical engineering experience
- Ability to meet Microsoft, customer and/or government security screening requirements
- Ability to work on-call to monitor system/product/service for degradation, downtime, or interruptions
- Must be able to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter