Mercor is partnering with a leading AI research lab to hire experienced Data Scientists specializing in AI task evaluation and statistical analysis to conduct comprehensive failure analysis on AI agent performance across finance-sector tasks — identifying systemic patterns, diagnosing performance bottlenecks, and improving model evaluation frameworks.
Requirements
- Strong foundation in statistical analysis, hypothesis testing, and pattern recognition.
- Proficiency in Python (pandas, scipy, matplotlib/seaborn) or R for data analysis.
- Hands-on experience with exploratory data analysis (EDA) and feature interpretation.
- Understanding of AI/ML evaluation methodologies and LLM performance metrics.
- Skilled in using Excel, SQL, and data visualization tools (e.g., Tableau, Looker).
- Experience with AI/ML model evaluation or quality assurance pipelines.
- Familiarity with benchmark datasets, failure mode analysis, and evaluation frameworks.
Responsibilities
- Statistical Failure Analysis: Identify recurring patterns in AI agent failures across task components (prompts, rubrics, file types, tags, etc.).
- Root Cause Analysis: Determine whether issues stem from task design, rubric clarity, file complexity, or agent limitations.
- Dimensional Analysis: Examine performance variations across finance sub-domains, file structures, and evaluation criteria.
- Visualization & Reporting: Build dashboards and analytical reports that highlight edge cases, performance clusters, and opportunities for improvement.
- Framework Enhancement: Recommend refinements to rubric design, evaluation metrics, and task structures based on empirical findings.
- Stakeholder Communication: Present key insights to data labeling teams, ML engineers, and research collaborators.
Other
- Part-time, 20–25 hours/week
- Fully remote and asynchronous — work on your own time
- Duration: 1–2 months, with strong potential for extension
- Start Date: Immediate
- 2–4 years of relevant professional experience in data science, analytics, or applied statistics.