Organize the world's information and generate and curate high-quality tokens for Gemini core model training.
Requirements
- In-depth experience and familiarity of LLM training and/or agents.
- Strong publication record in top machine learning conferences (e.g., NeurIPS, CVPR, ICML, ICLR, ICCV, ECCV).
- Solid skills & experience in software engineering for ML
- Expertise in one or more of the following areas of LLMs: Synthetic Data, Data Quality, Scaling Data
Responsibilities
- Research and develop methods to create diversified high-quality synthetic data, scale the creation through collaborations, evaluate & improve its effectiveness through ablation in pretraining/post-training/distillation.
- Research and develop methods to identify quality issues horizontally in the pretraining data corpus, innovate on how to fix, and evaluate & improve its effectiveness through ablation into landing.
- Stay up-to-date with the latest advancements in LLM research.
Other
- PhD in Computer Science or related field.
- Excellent communication and teamwork skills
- Passion for research and a desire to make a significant impact in the pretraining data area.