ByteDance is looking for PhD interns to contribute to the development of their AI foundation models, specifically focusing on distributed training, reinforcement learning, high-performance inference, and heterogeneous hardware compilation technologies.
Requirements
- Currently in PhD program in distributed, parallel computing principles and know the recent advances in computing, storage, networking, and hardware technologies.
- Familiar with machine learning algorithms, platforms and frameworks such as PyTorch and Jax.
- Have a basic understanding of how GPU and/or ASIC works.
- Expert in at least one or two programming languages in a Linux environment: C/C++, CUDA, Python.
- GPU based high performance computing, RDMA high performance network (MPI, NCCL, ibverbs).
- Distributed training framework optimizations such as DeepSpeed, FSDP, Megatron, GSPMD.
- AI compiler stacks such as torch.fx, XLA and MLIR.
Responsibilities
- Research and develop our machine learning systems, including heterogeneous computing architecture, management, scheduling, and monitoring.
- Manage cross-layer optimization of system and AI algorithms and hardware for machine learning (GPU, ASIC).
- Implement both general purpose training framework features and model specific optimizations (e.g. LLM, diffusions).
- Improve efficiency and stability for extremely large scale distributed training jobs.
Other
- Currently in PhD program
- Must obtain work authorization in country of employment at the time of hire, and maintain ongoing work authorization during employment.
- state your availability clearly in your resume (Start date, End date).