Meta's AI Training and Inference Infrastructure is growing exponentially to support ever increasing uses cases of AI, resulting in a dramatic scaling challenge that engineers have to deal with on a daily basis, and the need to build and evolve the network infrastructure to connect myriads of training accelerators like GPUs together.
Requirements
- Experience with using communication libraries, such as MPI, NCCL, and UCX
- Experience with developing, evaluating and debugging host networking protocols such as RDMA
- Experience with triaging performance issues in complex scale-out distributed applications
- Understanding of AI training workloads and demands they exert on networks
- Understanding of RDMA congestion control mechanisms on IB and RoCE Networks
- Experience with machine learning frameworks such as PyTorch and TensorFlow
- Experience in developing systems software in languages like C++
Responsibilities
- Active member of a multi-disciplinary team to develop solutions for large scale training systems
- Responsible for the overall performance of the communication system, including performance benchmarking, monitoring and troubleshooting production issues
- Identify potential performance issues across the stack: comms lib, RDMA transport, host networking, scheduling and network fabric. Develop and deploy innovative solutions to address the performance issues
Other
- Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- BS/MS/PhD in relevant fields (EE, CS), with 2+ years work experience
- Individual compensation is determined by skills, qualifications, experience, and location
- Meta offers benefits, including bonus, equity, and benefits
- Must be able to work from California if hired for this position