Netflix is looking to solve the problem of building a scalable and reliable machine learning platform to accelerate every ML practitioner at the company, with a focus on batch-prediction layer and large-scale batch inference workloads.
Requirements
- Hands-on experience with ML engineering or production systems involving training or inference of deep-learning models.
- Proven track record of operating scalable infrastructure for ML workloads (batch or online).
- Proficiency in one or more modern backend languages (e.g. Python, Java, Scala).
- Production experience with containerization & orchestration (Docker, Kubernetes, ECS, etc.) and at least one major cloud provider (AWS preferred).
- Deep understanding of real-world ML development workflows and close partnership with ML researchers or modeling engineers.
- Familiarity with cloud-based AI/ML services (e.g., SageMaker, Bedrock, Databricks, OpenAI, Vertex) or open-source stacks (Ray, Kubeflow, MLflow).
- Experience optimizing inference for large language models, computer-vision pipelines, or other foundation models (e.g., FSDP, tensor/pipeline parallelism, quantization, distillation).
Responsibilities
- Build developer-friendly APIs, SDKs, and CLIs that let researchers and engineers—experts and non-experts alike—submit and manage batch inference jobs with minimal effort, particularly in the domain of content and media
- Design, implement, and operate distributed services that package, schedule, execute, and monitor batch inference workflows at massive scale.
- Instrument the platform for reliability, debuggability, observability, and cost control; define SLOs and share an equitable on-call rotation
- Foster a culture of engineering excellence through design reviews, mentorship, and candid, constructive feedback
Other
- Excellent written and verbal communication skills; effective collaboration across distributed teams and time zones.
- Comfortable working in a team with peers and partners distributed across (US) geographies & time zones.
- Commitment to operational best practices—observability, logging, incident response, and on-call excellence.
- Bachelor's, Master's, or Ph.D. degree in Computer Science or related field (not explicitly mentioned but implied)
- Full-time hourly employees accrue 35 days annually for paid time off to be used for vacation, holidays, and sick paid time off.