Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Senior Research Engineer, LLM Evaluation and Behavioral Analysis

Together AI

$220,000 - $270,000

Dec 10, 2025

San Francisco, CA, US

Together AI is looking to solve the problem of building the fastest, most capable open-source-aligned LLMs and inference stack in the world by deeply understanding model behavior and building evaluation systems that ensure models behave intelligently and consistently in production

Requirements

Strong engineering skills with Python, evaluation tooling, and distributed workflows
Experience working with LLMs or transformer-based models, particularly in model evaluation, testing, or red-teaming
Ability to reason clearly about qualitative behavior, edge cases, and model failure patterns
Experience designing experiments, building datasets, and interpreting noisy behavioral signals
Understanding of function calling and structured output formats
Familiarity with GPU or distributed compute environments
Hands-on experience evaluating function-calling models, agentic systems, or tool-augmented LLM pipelines

Responsibilities

Build and iterate on evaluation frameworks that measure model performance across instruction following, function calling, long-context reasoning, multi-turn dialog, safety, and agentic behaviors
Develop specialized evaluation suites for function calling, agentic workflows, and tool-augmented interactions
Create CI/CD automated pipelines for A/B comparisons, regression detection, behavioral drift monitoring, and adversarial probing
Design and curate high-quality evaluation datasets, especially nuanced or challenging cases across domains
Collaborate with researchers and engineers to diagnose failures, triage regressions, and guide data selection, shaping strategies, objective design, and system improvements
Work with engineering teams to build dashboards, reports, and internal tools that help visualize behavior changes across releases
Operate in a fast-paced, high-impact environment with deep technical ownership and close partnership with world-class model researchers and infra engineers

Other

Passion for discovering subtle behaviors, surprising model gaps, or edge-case failures
Ability to work in a fast-paced, high-impact environment
Deep technical ownership and close partnership with world-class model researchers and infra engineers
US base salary range for this full-time position is: $220,000 – $270,000 + equity + benefits
Startup equity, health insurance, and other benefits