Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Hippocratic AI Logo

AI Engineer - Evaluations

Hippocratic AI

Salary not specified
Oct 30, 2025
Palo Alto, CA, US
Apply Now

Hippocratic AI is developing a safety-focused Large Language Model (LLM) for healthcare, aiming to improve healthcare accessibility and health outcomes globally by bringing deep healthcare expertise to every human. The AI Engineer – Evaluations role is crucial for defining and building systems to measure, validate, and improve the intelligence, safety, and empathy of their voice-based generative healthcare agents, ensuring clinical safety and adherence to best practices.

Requirements

  • Proficiency in Python and experience building data pipelines, evaluation frameworks, or ML infrastructure.
  • Familiarity with LLM evaluation techniques — including prompt testing, multi-agent workflows, and tool-using systems.
  • Understanding of deep learning fundamentals and how offline datasets, evaluation data, and experiments drive model reliability.
  • Experience developing agent harnesses or simulation environments for model testing.
  • Familiarity with reinforcement learning, retrieval-augmented evaluation, or long-context model testing.
  • 3+ years of software or ML engineering experience with a track record of shipping production systems end-to-end.

Responsibilities

  • Design and build evaluation frameworks and harnesses that measure the performance, safety, and trustworthiness of Hippocratic AI’s generative voice agents.
  • Prototype and deploy LLM-based evaluators to assess reasoning quality, empathy, factual correctness, and adherence to clinical safety standards.
  • Build feedback pipelines that connect evaluation signals directly to model improvement and retraining loops.
  • Develop reusable systems and tooling that enable contributions from across the company, steadily raising the quality bar for model behavior and user experience.
  • Define and build the systems that measure, validate, and improve the intelligence, safety, and empathy of our voice-based generative healthcare agents.
  • Design LLM-based auto-evaluators, agent harnesses, and feedback pipelines that ensure each model interaction is clinically safe, contextually aware, and grounded in healthcare best practices.
  • Work across the stack — from backend data pipelines and evaluation frameworks to tooling that surfaces insights for model iteration.

Other

  • Excellent communication skills with the ability to partner effectively across engineering, research, and clinical domains.
  • Passion for safety, quality, and real-world impact in AI-driven healthcare products.
  • Expected to be in the office five days a week in Palo Alto, CA, unless explicitly noted otherwise in the job description.