Microsoft's HPC/AI team is driving the creation of next-generation distributed AI supercomputers to deliver unmatched computational power, scalability, and reliability for AI breakthroughs. The team designs and builds advanced infrastructure for large-scale AI model training, aiming to redefine what AI can achieve.
Requirements
- coding in languages including, but not limited to, C, C++, Rust, or Python
- software design and development
- Distributed Systems
- coding in languages including, but not limited to, C, C++, C-Sharp, Java, JavaScript, OR Python
- High Performance Computing / Machine Learning middleware and Communication Runtime
- Hardware-Software co-design
- Profiling and Performance Analysis Tools
Responsibilities
- Design, develop, and optimize networking solutions tailored for large-scale AI training infrastructure.
- Benchmark, analyze, and enhance the scalability and reliability of networking systems to handle petabyte-scale data transfer.
- Debug and resolve complex networking issues in large-scale, high-performance environments.
- Create, implement, optimize, debug, refactor, and reuse code to establish and improve performance and maintainability, effectiveness, and return on investment (ROI).
- Proactively seek new knowledge and adapts to new AI trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance.
- Develop next-generation network transport protocols.
- Build RDMA-based communication libraries that deliver ultra-low latency and high throughput.
Other
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
- Problem-solving skills, analytical capabilities, and attention to detail.
- Familiarity with high performance networking hardware/architecture.
- Microsoft is an equal opportunity employer.