Microsoft Copilot is looking to build the best AI powered products in the world and needs someone to architect and build the infrastructure that makes that possible, specifically to close the gap between ML's potential and its messy reality in production
Requirements
- 6 years experience building and operating ML systems in production, with real stories about what breaks at scale and how you fixed it
- 5 years of experience of software engineering fundamentals with experience in distributed systems, containerization (Docker/Kubernetes), and cloud platforms (AWS/GCP/Azure)
- 5 years of hands-on experience with ML orchestration tools (Airflow, Kubeflow, Metaflow), experiment tracking, model registries, and feature stores
- 5 years of experience optimizing model inference, wrestled with GPU utilization, and know the tradeoffs between latency, throughput, and cost
- Familiarity with LLM deployment patterns, vector databases, prompt management, and the unique challenges of serving foundation models
- Experience working with RAG, fine-tuning pipelines, or evaluation frameworks
Responsibilities
- Training pipelines that scale elegantly - Design and implement robust training infrastructure that handles everything from data ingestion to model versioning, making it trivial for ML engineers to experiment and deploy with confidence
- The data flywheel - Build the infrastructure and product features that capture user interactions, ground truth labels, and edge cases, then automatically route them back into training loops. Turn every production interaction into a training example
- Inference systems that deliver - Dive deep into model serving architecture—optimize latency, manage costs, implement intelligent caching, and build the observability needed to maintain reliability at scale
- Deployment pipelines with guardrails - Create deployment systems that balance velocity with safety: automated testing, gradual rollouts, performance monitoring, and quick rollback mechanisms
- Cross-functional infrastructure - Partner closely with ML engineers, platform engineers, and data scientists to build APIs and tools that enable tight, rapid feedback loops from production back to model development
Other
- Doctorate in Computer Science, Statistics, Software Engineering, or related field AND 3 year(s) applied ML engineering experience
- OR Master's Degree in Computer Science, Statistics, Software Engineering, or related field AND 4 years applied ML engineering experience
- OR Bachelor's Degree in Computer Science, Data Engineering, Software Engineering, or related field AND 6 years applied ML experience
- Starting January 26, 2026, MAI employees are expected to work from a designated Microsoft office at least four days a week if they live within 50 miles (U.S.) or 25 miles (non-U.S., country-specific) of that location
- Desire and preference to work at the intersection of teams, translating between ML researchers who want flexibility and engineers who need reliability