Microsoft's Worldwide Fleet Resources Lifecycle Management team is looking to transform how the company manages its global hardware fleet, aiming to improve operational efficiency, reduce costs, and advance sustainability. This involves automating the verification, management, and delivery of new hardware for datacenters supporting services like Azure, HPC, Office, and Edge Computing, and enabling seamless capacity expansion for cloud services by integrating advanced hardware platforms.
Requirements
- Proven experience coding in languages including, but not limited to, C, C++, C-Sharp, Java, JavaScript, or Python.
- Experience in a technical role applying machine learning or mathematical optimization techniques to real-world problems.
- Experience building or integrating solutions using large language models (LLMs) or AI agents.
- Experience applying machine learning principles, including theoretical foundations such as algorithmic behavior, model architectures, optimization techniques, and statistical learning.
- Experience developing cloud-based solutions and implementing Machine Learning Operations (MLOps) strategies for model deployment, monitoring, and governance.
- Strong understanding of software engineering fundamentals including data structures, algorithms, testing methodologies, and design patterns.
Responsibilities
- Gathers requirements, designs solutions, and implements features that enable new technologies.
- Collaborates with stakeholders to identify user requirements, create design documents, and develop scalable systems and services.
- Utilizes strong software engineering fundamentals, including clean architecture, modular design, thorough testing, and peer reviews for reliable codebases.
- Develops and optimizes code to enhance performance, maintainability, effectiveness, and return on investment (ROI).
- Develops and deploys scalable AI-driven tools, algorithms, and machine learning (ML) models to enhance efficiency, reliability, and productivity.
- Serves as the Designated Responsible Individual (DRI) for monitoring, troubleshooting, and restoring production systems during on-call rotations.
- Leads live-site incident response, conducts root cause analysis, and implements long-term improvements to enhance system reliability and operational readiness.
Other
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
- Commitment to continuous personal and team development through a growth mindset.
- Works with product managers, engineers, and infrastructure teams to deliver impactful solutions.
- Follows organizational policies to ensure security, privacy, safety, and accessibility standards.