Our client is looking to solve the problem of revealing real risks in production AI by designing evaluation scenarios, datasets, and metrics
Requirements
- Strong Python + SQL + data-wrangling skills
- Hands-on experience with evaluation design, sampling, and calibration
- Comfort with dashboards (Grafana, PowerBI, or similar)
- Experience building golden datasets and structured evaluation traces
- Exposure to LLM or AI system evaluation (preferred)
- Experience in regulated industries (audit, finance, healthcare) is a plus
Responsibilities
- Design evaluation scenarios and metric frameworks to assess AI quality, suitability, reliability, and context-dependent behavior
- Build and maintain evaluation assets including datasets, golden traces, error taxonomies, and automated scoring/aggregation pipelines in partnership with engineering
- Develop and manage weekly reliability dashboards and automated reports, translating monitoring data into clear insights
- Analyze evaluation results to detect drift, outliers, context-driven failures, and calibration issues—validating evaluator reliability against human judgments
- Document test logic, metric definitions, and interpretation guidance, and support context-engineering workflows with metrics for predictability, observability, and directability
Other
- 3–6 years of experience
- Excellent communication — ability to turn technical data into decision-ready insights
- This is a contract role and does not offer health benefits
- Time Commitment: ~20 hours/week
- Location: Remote, in the U.S.