Revolutionizing the management and optimization of Microsoft's global fleet resources, automating hardware verification, management, and delivery to datacenters, and supporting the onboarding of new hardware into the Azure cloud.
Requirements
- 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C-Sharp, Java, JavaScript, or Python OR equivalent experience.
- 1+ years of experience in designing, deploying, and scaling AI/ML solutions on cloud platforms (Azure, AWS, GCP).
- 1+ years of experience in robust production operations, reliability engineering, and lifecycle management of AI/ML systems using MLOps/LLMOps best practices.
- Proven experience in designing and implementing end-to-end AI/ML solutions, integrating them seamlessly into both new and existing products and services across the full technology stack.
- Demonstrated expertise in compliance requirements, data governance, security best practices, and responsible AI principles.
- Strong foundation in core machine learning principles and algorithms, along with knowledge of deep learning architectures, natural language processing (NLP), and generative AI techniques.
- Experience with frameworks and libraries for ML (PyTorch, TensorFlow, Scikit-learn, Keras) and multi-agent AI applications (AutoGen, LangChain).
Responsibilities
- Develops and deploys scalable Artificial Intelligence (AI)-driven tools, algorithms, and machine learning (ML) models to enhance efficiency, reliability, and productivity.
- Collaborates with data scientists and product teams to align solutions with business objectives and deliver measurable value.
- Optimizes AI/ML models for performance and ensures seamless production integration.
- Serves as the Designated Responsible Individual (DRI) for monitoring, troubleshooting, and restoring production systems during on-call rotations.
- Leads live-site incident response, conducts root cause analysis, and implements long-term improvements to enhance system reliability and operational readiness.
- Demonstrates a commitment to continuous learning, staying up to date with evolving technologies and best practices.
- Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns to improve product availability, reliability, efficiency, observability, and performance.
Other
- Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
- Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
- Excellent cross-functional and interpersonal skills, with the ability to articulate solutions clearly and effectively.
- Ability to balance competing demands and adapt to changing priorities.
- Demonstrates ownership and promotes a learning-oriented, inclusive team environment.