Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Microsoft Logo

Senior Software Engineer

Microsoft

$119,800 - $258,000
Oct 31, 2025
Remote, US
Apply Now

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team is responsible for managing the core platform & fleet of AI High Performance Computing products that customers use to run their most performant and demanding workloads. The AI Customer Experience (AICE) engineering team within the HPC & AI Eng. team is on the frontlines managing the flagship supercomputers used by top tier AI customers that enable breakthroughs such as ChatGPT and are highlighted in Top500, MLPerf and Graph500 rankings. Operating at supercomputing scale requires specialized tools and techniques to ensure system reliability, runtime performance, and job health, while continuing to meet customer Service Level Agreements (SLAs).

Requirements

  • coding in languages including, but not limited to, C, C++, C-Sharp, OR Java, JavaScript, or Python
  • 3+ years of experience in operating AI/HPC systems, developing and running AI/HPC applications on clusters, or operating Cloud Infrastructure.
  • 2+ years of specialized experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure.
  • running and troubleshooting machine learning workloads on Graphics Processing Unit (GPU)-based High Performance Computing (HPC) systems, including familiarity with the HPC software stack.
  • experience with cloud computing, virtualization, and container technologies.

Responsibilities

  • diagnosing & troubleshooting the largest scale supercomputing systems across the infrastructure stack (GPU hardware, networking, datacenter and software)
  • develop and apply advanced tools
  • identify operational gaps
  • implement features that support the smooth operation of cloud-native supercomputers
  • develops and following the playbook
  • working on call to monitor system/product/service for degradation, downtime, or interruptions
  • Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale.

Other

  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
  • Collaborates with appropriate stakeholders to determine user requirements for a scenario.
  • Drives identification of dependencies and the development of design documents for a product, application, service, or platform.
  • Leverages subject-matter expertise of product features and partners with appropriate stakeholders (e.g., project managers) to drive a workgroup's project plans, release plans, and work items.