The Azure Kubernetes Service (AKS) team needs to build world-class container management and orchestration services for the cloud and beyond, specifically focusing on enhancing Kubernetes to support artificial intelligence and machine learning workloads. This involves developing infrastructure controllers, onboarding new Kubernetes features, integrating with other Microsoft Azure services, and building foundational components to automate and accelerate training and inference workflows to meet the rapidly evolving AI landscape.
Requirements
- coding in languages including, but not limited to, C, C++, golang, or Python
- 5+ years of experience designing, building, shipping, and operating reliable distributed systems.
- 3+ years of hands-on experience with scalable infrastructure and fault-tolerant architectures.
- 3+ years of delivering production-ready solutions in cloud environments.
- 1+ year(s) of experience working with artificial intelligence and machine learning workloads, including: Integration of training and inference pipelines into distributed systems. Optimization of compute resources for performance and reliability.
- 1+ year(s) of experience with container technologies and orchestration platforms, including: Use of containers such as Docker. Deployment and management of workloads using Kubernetes.
Responsibilities
- Develop infrastructure controllers, onboard new Kubernetes features, integrate with other Microsoft Azure services, and build foundational components to automate and accelerate training and inference workflows.
- Play a key role in defining the next generation of cloud-native infrastructure on Microsoft Azure.
- Enhance Kubernetes to support artificial intelligence and machine learning workloads.
- Design and implement solutions that advance Azure Kubernetes Service for artificial intelligence (AI) workloads.
- Fully leverage AI in product development.
- Work with emerging technologies, from software to hardware.
- Drive the features from idea to production.
Other
- Holds accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
- Maintains communication and make clarities with partners across the Microsoft ecosystem of engineers. Contribute to partner’s success.
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
- Travel 0-25 %
- Work site 0 days / week in-office - remote