Hippocratic AI is developing a safety-focused Large Language Model (LLM) for healthcare, aiming to improve healthcare accessibility and health outcomes globally by bringing deep healthcare expertise to every human. The Applied Machine Learning Engineer – Evaluations will be crucial in measuring, understanding, and improving their voice-based generative AI healthcare agents, translating qualitative notions into quantitative signals to guide model iteration and deployment.
Requirements
- 4+ years of experience in applied ML, ML engineering, or AI evaluation, with a focus on building and analyzing model pipelines.
- Strong skills in Python, with experience in data processing, experiment tracking, and model analysis frameworks (e.g., Weights & Biases, MLflow, Pandas).
- Familiarity with LLM evaluation methods, speech-to-text/text-to-speech models, or multimodal systems.
- Understanding of prompt engineering, model fine-tuning, and retrieval-augmented generation (RAG) techniques.
- Experience building human-in-the-loop evaluation systems or UX research tooling.
- Knowledge of visualization frameworks (e.g., Streamlit, Dash, React) for experiment inspection.
- Familiarity with speech or multimodal model evaluation, including latency, comprehension, and conversational flow metrics.
Responsibilities
- Design and implement evaluation harnesses for multimodal agent tasks, spanning speech, text, reasoning, and interaction flows.
- Build interactive visualization and analysis tools that help engineers, researchers, and clinicians inspect model and UX performance.
- Define, automate, and maintain continuous evaluation pipelines, ensuring regressions are caught early and model releases improve real-world quality.
- Collaborate with product and clinical teams to translate qualitative healthcare goals (e.g., empathy, clarity, compliance) into measurable metrics.
- Analyze evaluation data to uncover trends, propose improvements, and support iterative model tuning and fine-tuning.
- Design and implement evaluation harnesses, analysis tools, and visualization systems for multimodal agents that use language, reasoning, and speech.
Other
- Comfortable collaborating with cross-functional partners across research, product, and design teams.
- Deep interest in AI safety, healthcare reliability, and creating measurable systems for model quality.
- We value in-person teamwork and believe the best ideas happen together. Our team is expected to be in the office five days a week in Palo Alto, CA, unless explicitly noted otherwise in the job description.
- If you’re passionate about understanding how AI behaves, measuring it rigorously, and helping shape the next generation of clinically safe, empathetic voice agents, we’d love to hear from you.