Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Together AI Logo

Senior Research Engineer, LLM Evaluation and Behavioral Analysis

Together AI

$220,000 - $270,000
Dec 10, 2025
San Francisco, CA, US
Apply Now

Together AI is looking to solve the problem of building the fastest, most capable open-source-aligned LLMs and inference stack in the world by deeply understanding model behavior and building evaluation systems that ensure models behave intelligently and consistently in production

Requirements

  • Strong engineering skills with Python, evaluation tooling, and distributed workflows
  • Experience working with LLMs or transformer-based models, particularly in model evaluation, testing, or red-teaming
  • Ability to reason clearly about qualitative behavior, edge cases, and model failure patterns
  • Experience designing experiments, building datasets, and interpreting noisy behavioral signals
  • Understanding of function calling and structured output formats
  • Familiarity with GPU or distributed compute environments
  • Hands-on experience evaluating function-calling models, agentic systems, or tool-augmented LLM pipelines

Responsibilities

  • Build and iterate on evaluation frameworks that measure model performance across instruction following, function calling, long-context reasoning, multi-turn dialog, safety, and agentic behaviors
  • Develop specialized evaluation suites for function calling, agentic workflows, and tool-augmented interactions
  • Create CI/CD automated pipelines for A/B comparisons, regression detection, behavioral drift monitoring, and adversarial probing
  • Design and curate high-quality evaluation datasets, especially nuanced or challenging cases across domains
  • Collaborate with researchers and engineers to diagnose failures, triage regressions, and guide data selection, shaping strategies, objective design, and system improvements
  • Work with engineering teams to build dashboards, reports, and internal tools that help visualize behavior changes across releases
  • Operate in a fast-paced, high-impact environment with deep technical ownership and close partnership with world-class model researchers and infra engineers

Other

  • Passion for discovering subtle behaviors, surprising model gaps, or edge-case failures
  • Ability to work in a fast-paced, high-impact environment
  • Deep technical ownership and close partnership with world-class model researchers and infra engineers
  • US base salary range for this full-time position is: $220,000 – $270,000 + equity + benefits
  • Startup equity, health insurance, and other benefits