Developing AI software resiliency for the most powerful AI supercomputers in the world to drive down cluster downtime towards zero.
Requirements
- Proficiency in C++ and Python
- Strong understanding of distributed systems concepts, parallel programming, and fault tolerance in large-scale computing environments
- Familiarity with AI frameworks such as PyTorch, JAX/XLA, TensorFlow, or similar
- Experience with debugging and profiling tools (e.g., gdb, perf, valgrind, NVIDIA Nsight)
- Hands-on experience with CUDA, NCCL, or MPI for GPU-accelerated computing
- Knowledge of checkpointing strategies, error mitigation, or fault-tolerant computing in AI training
- Experience working with large-scale AI clusters, HPC environments, or cloud-based AI workloads
Responsibilities
- Develop AI Software Resiliency Features
- Hands-On Coding & Optimization
- Fault Tolerance & Debugging
- Collaborate Across Teams
- Testing & Automation
- Support Production Deployments
Other
- Pursuing or recently completed a Bachelor’s, Master’s or PhD in Computer Science, Electrical Engineering, or a related field, or equivalent experience
- Excellent problem-solving skills and ability to work in a fast-paced, highly collaborative environment