Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

MLOps Lead, Central Technology

Chan Zuckerberg Initiative

$241,000 - $331,000

Oct 1, 2025

Redwood City, CA, US

The Chan Zuckerberg Initiative is looking to solve some of society’s toughest challenges — from eradicating disease and improving education to addressing the needs of local communities, by leveraging technology to help build an inclusive, just, and healthy future for everyone.

Requirements

BS, MS, or PhD degree in Computer Science or a related technical discipline or equivalent experience
7+ years of relevant coding and systems experience
5+ years of systems Architecture and Design experience, with a broad range of MLOps experience across Data Infrastructure and AI/ML platforms
Proven technical leadership in SRE and MLOps related experience, as well as either direct or indirect people management experience
Strong experience scaling containerized applications on Kubernetes or Mesos, including expertise with creating custom containers using secure AMIs and continuous deployment systems that integrate with Kubernetes or Mesos. (Kubernetes preferred)
Cloud Platform proficiency with Amazon Web Services (AWS), Google Cloud Platform (GCP), or Microsoft Azure, and experience with On-Prem and Colocation Service hosting environments
MLOps experience working with medium to large scale GPU clusters in Kubernetes (Kubeflow), HPC environments, or large scale Cloud based ML deployments

Responsibilities

Provide technical MLOps leadership: for a team of MLOps Engineers, where you will manage and lead the team in operating our heterogeneous AI training and inference systems as well as collaborating in the design and build of our AI platform components.
Drive the application of MLOps and DevOps principles: across our multiple platforms, ensuring peak operational efficiency across our AI operations and process automation necessary for a world-class large-scale AI model training environment.
Instrumentation and Observation technical leadership: for the MLOps team, defining our end to end metrics program including full proactive monitoring and alerting systems
Facilitate model training through collaboration with our AI Researchers: alongside the rest of the AI Infrastructure Eng team work together to make sure that our models we are training and releasing to inference make use of best machine learning and deep learning practices, and are through code automation libraries fully resilient to restarts and checkpoint recoveries.
Continuous Optimization of our Kubernetes based AI Lifecycle platform: through our IAC based practices and integrating our MLOps AI Lifecycle platform tooling, alongside integrating this with our On-Prem HPC systems into a cohesive heterogeneous platform.
Collaboration on Data systems for our AI model training: with our Data Infrastructure Eng team as well as the Science data teams on the end to end data usage that drive our AI model training.
Lead our MLOps team supporting our on-call rotation: combining a focus on automation and proactive alerting focused on reducing on-call loads and improving self-healing AI system operations.

Other

BS, MS, or PhD degree in Computer Science or a related technical discipline or equivalent experience
7+ years of relevant coding and systems experience
Must be onsite for at least 60% of the working month, approximately 3 days a week, with specific in-office days determined by the team’s manager.
Paid time off to volunteer at an organization of your choice.
Funding for select family-forming benefits.