Fast-paced innovation in large language models (LLMs) and generative AI is reshaping personalization and discovery at Netflix. Our current bet is that an in-house LLM, customized on Netflix data, will be the cornerstone of this transformation. Model Evaluations and Data Curation (“Evals & Data”) are central to the development of LLMs and other high leverage foundation models at Netflix. Our Evals & Data team builds the benchmarks, evaluators, and baselines that guide LLM progress, and the data infrastructure that delivers high-quality, reproducible training and evaluation datasets to our AI/ML researchers. Together, these capabilities create the flywheel of continuous improvement between data, evals and modeling that drives innovation in LLMs that will power personalization and discovery in the future. We are incubating a centralized, first-class evaluation discipline, creating shared language, tools, and standards that enable application teams to measure progress consistently and with confidence, with an emphasis on Netflix’s LLM customized for personalization.
Requirements
- Strong technical expertise in LLMs, their evaluation, and practical methods for ensuring robustness, reproducibility, and quality.
- Broad knowledge of machine learning fundamentals and evaluation methodologies, including benchmark design, model-based evaluators, and offline/online metrics.
- Experience building and leading high-performing teams of ML researchers and engineers.
- Proven track record of leading machine learning initiatives from research to production, ideally involving evaluation frameworks, ML infrastructure, or data-intensive systems.
- Experience with large-scale ML systems and foundation models, especially LLMs.
- Background in building evaluation frameworks, model benchmarking, or data infrastructure for LLM training.
- Familiarity with multi-modal data and evaluation.
Responsibilities
- Partner with downstream AI application teams to define shared evaluations that codify application expectations of LLMs and other foundation models, ensuring progress can be transparently tracked against real-world needs.
- Design rigorous benchmarks and evaluation methodologies across ranking & recommendations, content understanding, and language/text generation — grounded in a deep technical understanding of LLMs, their strengths, limitations, and failure modes.
- Lead the development of evaluators and strong baselines to ensure in-house LLMs and other foundation models demonstrate clear advantages over off-the-shelf alternatives.
- Build scalable, reproducible data and evaluation systems that make dataset creation and evaluation design as nimble and experiment-friendly as model development itself.
- Work closely with the teams developing Netflix’s foundation models (including our core LLM) to ensure evaluation and data insights are folded back into the cadence of model development.
- Proactively influence the ML Platform and Data Engineering teams at key interfaces.
Other
- Hire, grow, and nurture a world-class team, fostering an inclusive, high-performing culture that balances research innovation with engineering excellence.
- Experience driving cross-functional projects, including close collaboration with AI application teams to translate product needs into evaluation frameworks.
- Excellent written and verbal communication skills, able to bridge technical and non-technical audiences.
- Advanced degree in Computer Science, Statistics, or a related quantitative field.
- 8+ years of overall experience, including 3+ years in engineering management.