ByteDance Networking is looking to solve the challenges of hyperscale data-center networking for popular applications like Douyin and TikTok, aiming for high availability, scalability, and high-performance, particularly to support AI/LLM applications.
Requirements
- Be familiar with network protocols like TCP and RoCEv2, have experience in network programming based on Socket and verbs APIs;
- Be familiar with Data Center congestion control algorithms, understand their pros and cons;
- Knowledge of scale-up protocols like PCIe, NVLink, UALink, and their differences with scale-out network protocols;
- Be familiar with the latest advances in the area of high-speed network systems, including RDMA, congestion control, AI network optimization and so on;
- Proficiency in one or several mainstream programming languages, including C/C++, Python, Go and so on;
- Have some knowledge of GPU architecture;
- Experience in developing high performance communication frameworks (including NCCL, MPI and RPC libraries) is a plus.
Responsibilities
- Design, optimization, implementation and deployment of high-performance transport protocols to support AI/LLM applications.
- Design, optimization, implementation and deployment of congestion control algorithms to support AI/LLM applications.
- Research and development of high-performance AI communication framework, network protocol stacks, and co-design optimization of host-network-application to improve the scalability, reliability and performance of AI/LLM networks.
- Follow the latest technologies from academia and industry, identify the innovative parts of the system and present in academic papers.
Other
- Currently pursuing a PhD in computer networking or a related technical discipline.
- Be familiar with AI training/inference systems and software-hardware co-design.
- Having top tier networking conference publications such as NSDI,SIGCOMM,OSDI,SOSP etc.