Apple is looking to develop novel measurements of the quality of machine-generated dialog, including cutting-edge llm-judges for aspects like groundedness, Siri Tone and Style, Safety, and others, and needs to track the state of dataset composition, accuracy of llm-judges, and human expert review results in a central and visual representation.
Requirements
- 3+ years in data-science and/or data-engineering (iceberg, pandas python, Tableau or equivalent, data collection and visualization)
- Good understanding of metrics, crowd science, annotation analysis, statistics
- Good engineering practices to create sustainable and easy to use metric reporting pipelines
- Attunement to computational linguistics, language quality is a plus
Responsibilities
- Build an easy to use dashboard for our datasets - requires integration with other teams
- Build dashboards to visualize our status
- Help define useful metrics
- Build tools to facilitate processing of human review results (inter-annotator agreement, storage, selection of the most useful data points for human expert review)
Other
- B.S., M.S. or Ph.D. in Computer Science, Data Science, Data Engineering
- Ability to work independently and cross-functionally to integrate in partner team reporting systems and pipelines
- Excellent communication skills and the ability to thrive in a highly collaborative work environment