The Chan Zuckerberg Initiative is seeking a Senior Technical Program Manager to lead cross-functional programs that accelerate the effectiveness of their AI/ML and Data Infrastructure teams. This role aims to improve how internal teams access, use, and scale compute and platform resources, including onboarding/offboarding workflows, access management systems, and infrastructure programs for efficient, secure, and impactful research and development.
Requirements
- 7+ years of experience in technical program management or infrastructure-focused operations in complex engineering environments.
- Proven ability to manage large-scale technical programs across multiple stakeholders and teams.
- High-level understanding of machine learning workflows and model training pipelines, with the ability to translate infrastructure needs between research and engineering teams.
- Familiarity with on-prem/HPC and/or multi cloud-based GPU infrastructure, orchestration tools, and platforms like Slurm, Run:AI, MLflow, W&B or similar systems is a huge plus.
Responsibilities
- Lead AI/ML infrastructure programs: Drive execution of technical initiatives across GPU scheduling, platform enablement, observability, or workload orchestration.
- Lead access and lifecycle workflows: Own the end-to-end experience for users accessing shared infrastructure resources—including onboarding, offboarding, documentation, and support processes.
- Coordinate infrastructure access requests: Manage intake and operational workflows for machine learning infrastructure access, including triage, tracking, and communication.
- Drive documentation systems: Own the structure, accuracy, and governance of internal documentation, onboarding guides, runbooks, and infrastructure wikis.
- Enhance visibility: Maintain and improve AI system dashboards and reporting systems for onboarding timelines, RFA volume, and infrastructure program milestones.
Other
- Strong organizational skills and experience leading cross-functional programs with tight timelines and multiple stakeholders.
- Excellent written and verbal communication skills, including the ability to align stakeholders at multiple levels.
- A passion for building efficient, secure, and inclusive systems to support cutting-edge science and research.
- This role is a hybrid position requiring you to be onsite for at least 60% of the working month, approximately 3 days a week, with specific in-office days determined by the team’s manager.