Hippocratic AI is developing a safety-focused Large Language Model (LLM) for healthcare, aiming to improve healthcare accessibility and health outcomes globally by bringing deep healthcare expertise to every human. The AI Engineer – Evaluations role is crucial for defining and building systems to measure, validate, and improve the intelligence, safety, and empathy of their voice-based generative healthcare agents, ensuring clinical safety and adherence to best practices.
Requirements
- Proficiency in Python and experience building data pipelines, evaluation frameworks, or ML infrastructure.
- Familiarity with LLM evaluation techniques — including prompt testing, multi-agent workflows, and tool-using systems.
- Understanding of deep learning fundamentals and how offline datasets, evaluation data, and experiments drive model reliability.
- Experience developing agent harnesses or simulation environments for model testing.
- Familiarity with reinforcement learning, retrieval-augmented evaluation, or long-context model testing.
- 3+ years of software or ML engineering experience with a track record of shipping production systems end-to-end.
Responsibilities
- Design and build evaluation frameworks and harnesses that measure the performance, safety, and trustworthiness of Hippocratic AI’s generative voice agents.
- Prototype and deploy LLM-based evaluators to assess reasoning quality, empathy, factual correctness, and adherence to clinical safety standards.
- Build feedback pipelines that connect evaluation signals directly to model improvement and retraining loops.
- Develop reusable systems and tooling that enable contributions from across the company, steadily raising the quality bar for model behavior and user experience.
- Define and build the systems that measure, validate, and improve the intelligence, safety, and empathy of our voice-based generative healthcare agents.
- Design LLM-based auto-evaluators, agent harnesses, and feedback pipelines that ensure each model interaction is clinically safe, contextually aware, and grounded in healthcare best practices.
- Work across the stack — from backend data pipelines and evaluation frameworks to tooling that surfaces insights for model iteration.
Other
- Excellent communication skills with the ability to partner effectively across engineering, research, and clinical domains.
- Passion for safety, quality, and real-world impact in AI-driven healthcare products.
- Expected to be in the office five days a week in Palo Alto, CA, unless explicitly noted otherwise in the job description.