Khan Academy is looking for an ML Data Engineer to evolve their eval dataset tools to meet the growing platform needs of AI-based tutoring, ensuring reliable, well-structured datasets that reflect the diversity and nuance of real learners.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field.
- 5 years of Software Engineering experience with 3+ of those years working with large ML datasets, especially those in open-source repositories such as Hugging Face
- Strong programming skills in Go, Python, SQL, and at least one data pipeline framework (e.g., Airflow, Dagster, Prefect).
- Experience with data versioning tools (e.g., DVC, LakeFS) and cloud storage systems.
- Familiarity with machine learning workflows — from training data preparation to evaluation.
- Familiarity with the architecture and operation of large language models, and a nuanced understanding of their capabilities and limitations.
- Attention to detail and an obsession with data quality and reproducibility.
Responsibilities
- Evolve and maintain pipelines for transforming raw trace data into ML-ready datasets.
- Clean, normalize, and enrich data while preserving semantic meaning and consistency.
- Prepare and format datasets for human labeling, and integrate results into ML datasets.
- Develop and maintain scalable ETL pipelines using Airflow, DBT, Go, and Python running on GCP
- Implement automated tests and validation to detect data drift or labeling inconsistencies.
- Collaborate with AI engineers, platform developers, and product teams to define data strategies in support of continuously improving the quality of Khan’s AI-based tutoring.
- Contribute to shared tools and documentation for dataset management and AI evaluation.
Other
- Motivated by the Khan Academy mission “to provide a free world-class education for anyone, anywhere.”
- Proven cross-cultural competency skills demonstrating self-awareness, awareness of other, and the ability to adopt inclusive perspectives, attitudes, and behaviors to drive inclusion and belonging throughout the organization.
- Bachelor’s or Master’s degree in Computer Science, Data Engineering, or a related field.
- 5 years of Software Engineering experience
- Competitive salaries
- Ample paid time off as needed – Your well-being is a priority