Together AI is looking to solve the problem of building the fastest, most capable open-source-aligned LLMs and inference stack in the world by deeply understanding model behavior and building evaluation systems that ensure models behave intelligently and consistently in production
Requirements
- Strong engineering skills with Python, evaluation tooling, and distributed workflows
- Experience working with LLMs or transformer-based models, particularly in model evaluation, testing, or red-teaming
- Ability to reason clearly about qualitative behavior, edge cases, and model failure patterns
- Experience designing experiments, building datasets, and interpreting noisy behavioral signals
- Understanding of function calling and structured output formats
- Familiarity with GPU or distributed compute environments
- Hands-on experience evaluating function-calling models, agentic systems, or tool-augmented LLM pipelines
Responsibilities
- Build and iterate on evaluation frameworks that measure model performance across instruction following, function calling, long-context reasoning, multi-turn dialog, safety, and agentic behaviors
- Develop specialized evaluation suites for function calling, agentic workflows, and tool-augmented interactions
- Create CI/CD automated pipelines for A/B comparisons, regression detection, behavioral drift monitoring, and adversarial probing
- Design and curate high-quality evaluation datasets, especially nuanced or challenging cases across domains
- Collaborate with researchers and engineers to diagnose failures, triage regressions, and guide data selection, shaping strategies, objective design, and system improvements
- Work with engineering teams to build dashboards, reports, and internal tools that help visualize behavior changes across releases
- Operate in a fast-paced, high-impact environment with deep technical ownership and close partnership with world-class model researchers and infra engineers
Other
- Passion for discovering subtle behaviors, surprising model gaps, or edge-case failures
- Ability to work in a fast-paced, high-impact environment
- Deep technical ownership and close partnership with world-class model researchers and infra engineers
- US base salary range for this full-time position is: $220,000 – $270,000 + equity + benefits
- Startup equity, health insurance, and other benefits