Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

AI Engineer - Evaluations

Hippocratic AI

Salary not specified

Oct 30, 2025

Palo Alto, CA, US

Hippocratic AI is developing a safety-focused Large Language Model (LLM) for healthcare, aiming to improve healthcare accessibility and health outcomes globally by bringing deep healthcare expertise to every human. The AI Engineer – Evaluations role is crucial for defining and building systems to measure, validate, and improve the intelligence, safety, and empathy of their voice-based generative healthcare agents, ensuring clinical safety and adherence to best practices.

Requirements

Proficiency in Python and experience building data pipelines, evaluation frameworks, or ML infrastructure.
Familiarity with LLM evaluation techniques — including prompt testing, multi-agent workflows, and tool-using systems.
Understanding of deep learning fundamentals and how offline datasets, evaluation data, and experiments drive model reliability.
Experience developing agent harnesses or simulation environments for model testing.
Familiarity with reinforcement learning, retrieval-augmented evaluation, or long-context model testing.
3+ years of software or ML engineering experience with a track record of shipping production systems end-to-end.

Responsibilities

Design and build evaluation frameworks and harnesses that measure the performance, safety, and trustworthiness of Hippocratic AI’s generative voice agents.
Prototype and deploy LLM-based evaluators to assess reasoning quality, empathy, factual correctness, and adherence to clinical safety standards.
Build feedback pipelines that connect evaluation signals directly to model improvement and retraining loops.
Develop reusable systems and tooling that enable contributions from across the company, steadily raising the quality bar for model behavior and user experience.
Define and build the systems that measure, validate, and improve the intelligence, safety, and empathy of our voice-based generative healthcare agents.
Design LLM-based auto-evaluators, agent harnesses, and feedback pipelines that ensure each model interaction is clinically safe, contextually aware, and grounded in healthcare best practices.
Work across the stack — from backend data pipelines and evaluation frameworks to tooling that surfaces insights for model iteration.

Other

Excellent communication skills with the ability to partner effectively across engineering, research, and clinical domains.
Passion for safety, quality, and real-world impact in AI-driven healthcare products.
Expected to be in the office five days a week in Palo Alto, CA, unless explicitly noted otherwise in the job description.