Microsoft Azure AI and HPC team needs systems engineers to support customers in deploying, monitoring, profiling, and debugging their applications on hyperscale cloud infrastructure, ensuring system reliability, runtime performance, and job health for cloud-native supercomputers.
Requirements
- coding in languages including, but not limited to, C, C++, C, OR Java, JavaScript, or Python
- experience in operating AI/HPC systems, developing and running AI/HPC applications on clusters, or operating Cloud Infrastructure.
- specialized experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure.
- running and troubleshooting machine learning workloads on Graphics Processing Unit (GPU)-based High Performance Computing (HPC) systems, including familiarity with the HPC software stack.
- experience with cloud computing, virtualization, and container technologies.
Responsibilities
- develop and apply advanced tools, identify operational gaps, and implement features that support the smooth operation of cloud-native supercomputers
- help establish best practices, influence architectural decisions, and contribute to the roadmap of key software and hardware components
- Be part of a comprehensive systems management team focused on operational excellence and customer success.
- Build tools and analyze key system metrics and telemetry to proactively identify and debug HPC system issues.
- Partner with customers, vendors, and other teams within Azure to drive comprehensive solutions for operating world class Supercomputers in the public cloud environment.
- Help ensure Azure platform is consistent on performance, can scale on-demand, and engineered to withstand the unparalleled computing demand from the customer workloads.
- Contribute to a test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality.
Other
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
- Microsoft’s mission is to empower every person and every organization on the planet to achieve more.
- As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals.
- Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.