Data quality mindset: trace hygiene, metadata design, policy/PII awareness, and principled guardrails.
Experience building graders that score persona/tone, contract/formatting (e.g., JSON validity, schema), and tool‑use correctness.
Background with structured synthetic data generation and vendor annotation programs; familiarity with judge mutation/optimization loops.
AI & Technical Fluency: You don't need to train models, but you know how they work, how to test them, and how to build great products on top of them.
Responsibilities
Evaluation & Feedback Analysis
Convert multi‑source feedback (dogfood, VIP customers, production traces) into a prioritized dataset of 10–100 tasks per scenario, each with prompts and golden outputs; maintain a living failure taxonomy prioritized by volume × impact × fixability.
Build grader prompts (with few‑shots and counter‑examples) that achieve ≥80% human‑match rate, track TPR/TNR on held‑out sets, and prevent reward hacking.
Synthetic & Human‑Labeled Data
Design structured tuples to scale high‑signal synthetic data; orchestrate vendor/partner annotation sprints and live calibrations to align shared judgment.
Ensure datasets are reproducible with linked artifacts and robust metadata/trace hygiene.
Customer‑Grounded Scenarios
Other
Doctorate in Data Science, Mathematics, Statistics, Econometrics, Economics, Operations Research, Computer Science, or related field AND 1+ year(s) data-science experience
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role.
Ability to work in a fast-paced, ambiguous environment and deliver results under tight deadlines.
2+ years customer-facing, project-delivery experience, professional services, and/or consulting experience.
Experience in communication and stakeholder management skills.