Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Principal Applied Scientist

Microsoft

$163,000 - $331,200

Nov 22, 2025

Redmond, WA, United States of America

Microsoft's Azure Data engineering team is looking to build the data platform for the age of AI, powering a new class of data-first applications and driving a data culture. The Real-Time Intelligence (RTI) team within Microsoft Fabric is seeking a Principal Applied Scientist to lead the science of evaluating and improving LLM-powered agents operating on live operational data.

Requirements

2+ years designing and running ML/LLM evaluation and experimentation (offline metrics + online A/B tests)
Proven experience applying machine learning, statistics, and measurement science to LLM and agent evaluation, ideally in real-time or streaming scenarios.
Proficiency in agentic AI concepts (e.g., multi-step agents, tool orchestration, retrieval/RAG, workflow automation) and familiarity with techniques for assessing safety, robustness, anomaly detection, and causal impact of agent behaviors.
Strong programming and modeling skills in languages such as Python, and experience building evaluation services or pipelines on distributed systems (e.g., running large-scale offline evals, auto-raters, or LLM-as-judge workloads).
Ability to design, implement, and interpret rigorous evaluations end-to-end: constructing eval sets and scenarios, combining offline metrics with human/LLM raters, running online experiments (A/B tests, holdouts), and instrumenting reliability monitoring at scale.

Responsibilities

Lead end-to-end science for evaluating LLM-powered agents on real-time and batch workloads: designing evaluation frameworks, metrics, and pipelines that capture planning quality, tool use, retrieval, safety, and end-user outcomes, and partnering with engineering for robust, low-latency deployment.
Advance evaluation methodologies for agents across RTI surfaces by driving test set design, auto-raters (including LLM-as-judge), human-in-the-loop feedback loops, and measurable lifts in key quality metrics such as task success rate, reliability, and safety.
Establish rigorous evaluation and reliability practices for LLM/agent systems: from offline benchmarks and scenario-based evals to online experiments and production monitoring, defining guardrails and policies that balance quality, cost, and latency at scale.
Collaborate with PM, Engineering, and UX to translate evaluation insights into customer-visible improvements, shaping product requirements, de-risking launches, and iterating quickly based on telemetry, user feedback, and real-world failure modes.
Provide technical leadership and mentorship within the applied science and engineering community, fostering inclusive, responsible-AI practices in agent evaluation, and influencing roadmap, platform investments, and cross-team evaluation strategy across Fabric.

Other

Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter.
Collaborative mindset with demonstrated success partnering across Engineering, PM, and UX to define quality bars, translate evaluation insights into roadmap decisions, and iterate quickly on customer-facing agent and LLM experiences.
Embody our culture and values
Microsoft is an equal opportunity employer.