The company needs to develop and improve its machine learning systems and distributed training jobs.
Requirements
- Familiar with machine learning algorithms, platforms and frameworks such as PyTorch and Jax
- Have basic understanding of how GPU and/or ASIC works
- Expert in at least one or two programming languages in Linux environment: C/C++, CUDA, Python
- Preferred: GPU based high performance computing, RDMA high performance network (MPI, NCCL, ibverbs)
- Preferred: Distributed training framework optimizations such as DeepSpeed, FSDP, Megatron, GSPMD
- Preferred: AI compiler stacks such as torch.fx, XLA and MLIR
- Preferred: Large scale data processing and parallel computing
Responsibilities
- Research and develop our machine learning systems, including architecture, management, scheduling, and monitoring
- Manage cross-layer optimization of system and AI algorithms and hardware for machine learning
- Improve efficiency and stability for extremely large scale distributed training jobs
Other
- Currently pursuing a MS in Software Development, Computer Science, Computer Engineering, or a related technical discipline
- Must obtain work authorization in country of employment at the time of hire, and maintain ongoing work authorization during employment