Netflix is looking to accelerate innovation across all product functions and decision-support needs by building highly scalable and differentiated ML infrastructure. The Model Development & Management (MDM) team aims to maximize the impact of ML by building differentiated, scalable infrastructure that accelerates research and product iteration across various use cases, including recommendations, growth, studio, content understanding, and generative AI.
Requirements
- Experience leading teams responsible for building state‑of‑the‑art ML model development platforms that cover the full model development lifecycle.
- A track record working on distributed ML infrastructure that spans laptop‑to‑cluster execution, supports multi‑node GPU training, and serves large‑scale models (recommenders, computer vision, LLMs, multimodal GenAI).
- Deep familiarity with containerization/orchestration, dependency and environment management (e.g., pinned specs, environment locks), and secure packaging practices for reliable, repeatable runs.
- Proficiency with ML frameworks and commercial ML/AI infrastructure, such as PyTorch, SageMaker, Ray, and Hugging Face, etc....
- Strong ML infrastructure background (SDK/CLI design, packaging and environments, experiment tracking/lineage, observability).
- Experience managing a hybrid team with partners and team members distributed across U.S. geographies and time zones.
- A passion for translating the needs of ML practitioners into platform offerings with an emphasis on automation and self‑service capabilities.
Responsibilities
- Architect, build, test, and launch a cohesive SDK and set of opinionated templates that let practitioners scaffold projects, configure and execute runs (from laptop to tightly coupled multi-node GPU training), track experiments and lineage, package models with evaluation hooks, and promote them confidently.
- Partner with ML practitioners and adjacent pillars (Feature/Data, Training, Serving, Evaluation) to translate needs into a unified developer experience that hides infrastructure complexity while preserving expert control.
- Drive the strategy and vision of the Model Development SDK—owning the portfolio of existing and new products, making build‑vs‑buy choices, and integrating libraries/frameworks into the unified platform.
- Build and execute a metrics‑led roadmap: define Developer Experience (DX) KPIs, plan incremental delivery and migrations, and demonstrate impact through adoption and reuse.
- Maintain and evolve current product offerings that are widely adopted both in OSS and internally (e.g., Metaflow).
- Design for extensibility as the space evolves, keep interfaces stable with clear deprecation policies, and prioritize measurable outcomes that lift practitioner velocity across Netflix.
- Operate cross-functionally with Training Platform and Offline Inference, Serving Systems, Feature/Data Infrastructure, and MLP Tooling to deliver a seamless, consistent experience end-to-end.
Other
- 10+ years of software engineering experience and 3+ years building and leading engineering teams.
- Communicate progress, milestones, and risks to stakeholders, customers, and senior leadership.
- Hire, grow, and coach a diverse team across Core Frameworks and User Experience pods (and incubate Exploratory Infra as needs emerge), fostering an inclusive, high‑ownership culture.
- Strong communication and collaboration skills, with the ability to build durable relationships with internal customers and external partners.
- Demonstrated ability to develop, drive, and execute a technical vision and roadmap.
- A track record of attracting top talent and growing a high‑performing, diverse team of tenured engineers to deliver results in a fast‑paced environment.
- Excellent product taste for developer experience, and the judgment to balance paved-path simplicity with power-user control.