Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Principal Supercomputing Software Engineer

Microsoft

$137,600 - $294,000

Sep 12, 2025

Remote, US

Microsoft Azure AI/HPC team needs to enable customers in deploying, monitoring, profiling, and debugging their applications on hyperscale cloud infrastructure. The goal is to maintain the reliability, runtime performance, and health of the system and running jobs to meet customer SLAs, facilitating growth and innovation in AI and HPC in the cloud.

Requirements

coding in languages including, but not limited to, C, C++, C, Java, JavaScript, or Python
5+ years of experience in operating AI/HPC systems, developing and running AI/HPC applications on clusters, or operating Cloud Infrastructure
3+ years of specialized experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
Operational experience running large scale HPC systems or infrastructure situated in Cloud environments
Previous experience with running and troubleshooting machine learning workloads on GPU-based HPC systems
Expertise in Cloud Computing, Virtualization and Container Technologies
Familiarity with the HPC software stack

Responsibilities

build and use state-of-the-art tools and techniques
find operational gaps and instrument features to achieve the smooth operation of cloud-native supercomputers
establishing best practices
drive architectural changes
influence roadmap of relevant software and hardware components
Analyze key system metrics and telemetry to proactively identify and debug HPC system issues
build appropriate tooling, help develop processes and ensure that solutions are responsive to emerging user needs

Other

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Be part of a comprehensive systems management team focused on operational excellence and customer success
Partner with customers, vendors, and other teams within Azure to drive comprehensive solutions for operating world class Supercomputers in the public cloud environment
Foster test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality