Mercor is looking to solve the problem of improving frontier language models by providing the human intelligence essential to AI development. The Research Engineer will contribute directly to post-training and RLVR, synthetic data generation, and large-scale evaluation workflows that meaningfully impact these models.
Requirements
- Strong applied research background, with a focus on post-training and/or model evaluation.
- Strong coding proficiency and hands-on experience working with machine learning models.
- Strong understanding of data structures, algorithms, backend systems, and core engineering fundamentals.
- Familiarity with APIs, SQL/NoSQL databases, and cloud platforms.
- Ability to reason deeply about model behavior, experimental results, and data quality.
- Real-world post-training team experience in industry (highest priority).
- Experience training models or evaluating model performance.
Responsibilities
- Work on post-training and RLVR pipelines to understand how datasets, rewards, and training strategies impact model performance.
- Design and run reward-shaping experiments and algorithmic improvements (e.g., GRPO, DAPO) to improve LLM tool-use, agentic behavior, and real-world reasoning.
- Quantify data usability, quality, and performance uplift on key benchmarks.
- Build and maintain data generation and augmentation pipelines that scale with training needs.
- Create and refine rubrics, evaluators, and scoring frameworks that guide training and evaluation decisions.
- Build and operate LLM evaluation systems, benchmarks, and metrics at scale.
- Collaborate closely with AI researchers, applied AI teams, and experts producing training data.
Other
- Operate in a fast-paced, experimental research environment with rapid iteration cycles and high ownership.
- Excitement to work in person in San Francisco, five days a week (with optional remote Saturdays), and thrive in a high-intensity, high-ownership environment.
- Publications at top-tier conferences (NeurIPS, ICML, ACL).
- Experience in synthetic data generation, LLM evaluations, or RL-style workflows.
- Work samples, artifacts, or code repositories demonstrating relevant skills.