The business is looking to improve the efficiency and stability of large scale distributed training jobs in machine learning systems.
Requirements
- Familiarity with machine learning algorithms and platforms
- Familiarity with C/C++ and Python development in Linux environments
- Familiarity with at least one deep learning framework (TensorFlow, PyTorch, MXNet, or other)
- GPU based high performance computing, RDMA high performance network (MPI, NCCL, ibverbs)
- Distributed training framework optimizations such as DeepSpeed, FSDP, Megatron, GSPMD
- Familiarity with AI compiler stacks such as torch.fx, XLA and MLIR
Responsibilities
- Research and develop efficient machine learning systems
- Develop a state-of-the-art asynchronous training framework
- Implement general purpose training framework features and model specific optimizations
- Improve efficiency and stability for extremely large scale distributed training jobs
Other
- Currently pursuing a MS in Software Development, Computer Science, Computer Engineering, or a related technical discipline
- Ability to work independently and complete projects from beginning to end and in a timely manner
- Good communication and teamwork skills to clearly communicate technical concepts with other teammates
- Must obtain work authorization in country of employment at the time of hire, and maintain ongoing work authorization during employment