The team is developing novel measurements for the quality of machine-generated dialog, including LLM judges for aspects like groundedness, tone, style, and safety. They need to track the state of dataset composition, accuracy of LLM judges, and human expert review results in a central and visual representation.
Requirements
- 3+ years in data-science and/or data-engineering (iceberg, pandas python, Tableau or equivalent, data collection and visualization)
- Good understanding of metrics, crowd science, annotation analysis, statistics
- Good engineering practices to create sustainable and easy to use metric reporting pipelines
- Attunement to computational linguistics, language quality is a plus
Responsibilities
- Build an easy to use dashboard for our datasets - requires integration with other teams
- Build dashboards to visualize our status
- Help define useful metrics
- Build tools to facilitate processing of human review results (inter-annotator agreement, storage, selection of the most useful data points for human expert review)
Other
- Ability to work independently and cross-functionally to integrate in partner team reporting systems and pipelines
- Excellent communication skills and the ability to thrive in a highly collaborative work environment