C3 AI is looking to build a bespoke, next-generation research platform for training novel, large-scale foundation models that goes beyond conventional LLM recipes. The goal is to empower researchers with an orchestration, secure data pathways, and frictionless developer experience to enable fast, secure experimentation and scaling of complex training jobs on heterogeneous GPU clusters.
Requirements
- Deep expertise with Kubernetes and/or SLURM on GPU clusters, including proficiency with containers, images, and multi-node scheduling.
- Strong software development skills in Python and one of Go, C++, or Rust.
- Comfortable developing controllers/operators, high-performance services, and CLI tooling on Linux.
- Practical, hands-on knowledge of distributed ML frameworks (PyTorch DDP/FSDP/ZeRO, DeepSpeed, or JAX/TPU) and performance profiling (NCCL, CUDA basics, I/O performance).
- Experience with object stores, Parquet format, dataset version control, streaming/sharding techniques, and efficient artifact management for checkpoints and logs.
- Practical experience with observability (Prometheus/Grafana/OpenTelemetry) and infra-as-code (Terraform/Helm/Ansible).
- Experience with high-speed networking and storage, including InfiniBand/RDMA, GPUDirect-RDMA, NVLink topology, and high-throughput file/object systems.
Responsibilities
- Design and manage the core research compute cluster, including node layouts, queues/partitions, preemption/fair-share policies, and multi-tenant isolation.
- Implement secure access controls for all users and services across the cluster using Kubernetes and/or SLURM.
- Build robust branch-to-experiment CI/CD workflows, encompassing templated job creation, config promotion, and integrated version control.
- Implement an experiment and metrics tracking system (runs, configs, checkpoints, logs) with searchable lineage to enable frictionless cross-team collaboration and sharing.
- Design and integrate auto-checkpointing, artifact retention, and necessary rollout/rollback mechanisms.
- Stand up robust dataset registries, ensuring data lineage, versioning, and secure access.
- Implement sharding, streaming, and prefetch mechanisms to support efficient TB-scale data corpora access and long-term archival with reproducible rehydration.
Other
- BS/MS in Computer Science/Electrical Engineering or equivalent deep, practical experience.
- Proven track record building custom ML/HPC platforms for specialized research (e.g., novel model architectures, time-series, multimodal AI) where commercial cloud tools were insufficient.
- A pragmatic, product-focused approach to researcher ergonomics, demonstrated by platforms you have shipped that materially increased experiment throughput and velocity.