The Chan Zuckerberg Initiative (CZI) needs to build shared tools and platforms for its AI/ML and Data Engineering Infrastructure organization to support a wide range of Research Scientists, Data Scientists, and Engineers. The AI Infrastructure Engineering team aims to enable AI Research teams to achieve their goals faster and more securely by leveraging technology to automate manual processes, optimize operations, and provide first-class support.
Requirements
- Proven proficiency in a systems language (C, C++, C, Go, Rust, Java, Scala) and a scripting language (Python, PHP, Ruby).
- Expertise in cloud platforms (AWS, GCP, Azure) and hybrid environments, including on-premises and colocation hosting.
- Strong experience in AI/ML platform operation technologies (e.g. Slrum, Sunk, Run:ai, Kubeflow)
- Advanced skills in scaling and securing containerized applications on Kubernetes, including custom container development and CI/CD integration.
- Working knowledge of Nvidia CUDA, AI/ML custom libraries, and Linux systems optimization/administration.
Responsibilities
- Lead the design and delivery of secure, scalable, and high-performance AI/ML compute infrastructure.
- Architect and implement containerized AI/ML platforms using Kubernetes for heterogeneous, distributed environments.
- Integrate on-prem (High Performance Compute) and cloud-based AI platforms with GPU clusters to support pre-training, training, fine-tuning, and inference workflows.
- Define and execute systems integration strategies to maximize performance, scalability, and security for AI workloads.
- Enable research teams to effectively use AI platforms through best practices in lifecycle management and deployment.
- Solve complex challenges in scaling AI workflows and optimizing model training and inference pipelines.
Other
- BS/MS in Computer Science or related field, or equivalent experience, with 8+ years in coding and systems architecture/design across AI/ML and core infrastructure.
- This role is a hybrid position requiring you to be onsite for at least 60% of the working month, approximately 3 days a week, with specific in-office days determined by the team’s manager.
- The exact schedule will be at the hiring manager's discretion and communicated during the interview process.