Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Netflix Logo

Engineering Manager, Machine Learning, Model Evaluations and Data Curation (AI Foundations)

Netflix

$190,000 - $920,000
Oct 6, 2025
Remote, US
Apply Now

Fast-paced innovation in large language models (LLMs) and generative AI is reshaping personalization and discovery at Netflix. Our current bet is that an in-house LLM, customized on Netflix data, will be the cornerstone of this transformation. Model Evaluations and Data Curation (“Evals & Data”) are central to the development of LLMs and other high leverage foundation models at Netflix. Our Evals & Data team builds the benchmarks, evaluators, and baselines that guide LLM progress, and the data infrastructure that delivers high-quality, reproducible training and evaluation datasets to our AI/ML researchers. Together, these capabilities create the flywheel of continuous improvement between data, evals and modeling that drives innovation in LLMs that will power personalization and discovery in the future. We are incubating a centralized, first-class evaluation discipline, creating shared language, tools, and standards that enable application teams to measure progress consistently and with confidence, with an emphasis on Netflix’s LLM customized for personalization.

Requirements

  • Strong technical expertise in LLMs, their evaluation, and practical methods for ensuring robustness, reproducibility, and quality.
  • Broad knowledge of machine learning fundamentals and evaluation methodologies, including benchmark design, model-based evaluators, and offline/online metrics.
  • Experience building and leading high-performing teams of ML researchers and engineers.
  • Proven track record of leading machine learning initiatives from research to production, ideally involving evaluation frameworks, ML infrastructure, or data-intensive systems.
  • Experience with large-scale ML systems and foundation models, especially LLMs.
  • Background in building evaluation frameworks, model benchmarking, or data infrastructure for LLM training.
  • Familiarity with multi-modal data and evaluation.

Responsibilities

  • Partner with downstream AI application teams to define shared evaluations that codify application expectations of LLMs and other foundation models, ensuring progress can be transparently tracked against real-world needs.
  • Design rigorous benchmarks and evaluation methodologies across ranking & recommendations, content understanding, and language/text generation — grounded in a deep technical understanding of LLMs, their strengths, limitations, and failure modes.
  • Lead the development of evaluators and strong baselines to ensure in-house LLMs and other foundation models demonstrate clear advantages over off-the-shelf alternatives.
  • Build scalable, reproducible data and evaluation systems that make dataset creation and evaluation design as nimble and experiment-friendly as model development itself.
  • Work closely with the teams developing Netflix’s foundation models (including our core LLM) to ensure evaluation and data insights are folded back into the cadence of model development.
  • Proactively influence the ML Platform and Data Engineering teams at key interfaces.

Other

  • Hire, grow, and nurture a world-class team, fostering an inclusive, high-performing culture that balances research innovation with engineering excellence.
  • Experience driving cross-functional projects, including close collaboration with AI application teams to translate product needs into evaluation frameworks.
  • Excellent written and verbal communication skills, able to bridge technical and non-technical audiences.
  • Advanced degree in Computer Science, Statistics, or a related quantitative field.
  • 8+ years of overall experience, including 3+ years in engineering management.