ByteDance is looking to solve the problem of developing and maintaining massively distributed ML training and inference systems/services, providing high-performance, highly reliable, scalable systems for LLM/AIGC/AGI.
Requirements
- Excellent coding ability, solid foundation in data structures and basic algorithms, proficient in C/C++ or Python.
- Familiar with at least one mainstream machine learning framework (TensorFlow/PyTorch/Jax).
- Master the principles of distributed systems, and participated in the design, development, and maintenance of large-scale distributed systems.
- Prior experience in large-scale projects or papers with great influence in the field of large models.
- Familiar with NLP, CV-related algorithms, and technologies, and experienced in large model training and RL algorithms.
- Experience in one of the following fields: CUDA, RDMA, AI Infrastructure, HW/SW Co-Design, High-Performance Computing (cutlass, NCCL), ML Hardware Architecture (GPU, Accelerators, Networking), ML for System, and Distributed Storage.
- Demonstrated a related technical experience from previous internship, work experience, coding competitions, or publications
Responsibilities
- Responsible for the design and development of the architecture of large-scale machine learning systems, solving technical difficulties such as high concurrency, high reliability, and high scalability of the system.
- Covering various sub-directions of machine learning system, including resource scheduling, model training, model inference, data management, and workflow orchestration.
- Responsible for the research and introduction of advanced technologies in machine learning systems, such as the latest hardware architecture, heterogeneous computing systems, and compiler-based optimization technologies.
- Working closely with the algorithm teams to optimize the algorithm and system jointly.
- Responsible for the machine learning system development of the company's large-scale models, researching new applications and solutions of related technologies in areas such as search, recommendation, advertising, content creation, conversation, and customer service.
- Meeting the growing demand for intelligent interaction from users, and comprehensively improving users' lifestyles and communication methods in the future world.
- Building the large-scale heterogeneous system integrating with GPU/NPU/RDMA/Storage and keeping it running stable and reliable.
Other
- Final year or recent PhD graduate with a background in Computer Science, related technical field or equivalent industrial research experience
- Must obtain work authorization in the country of employment at the time of hire, and maintain ongoing work authorization during employment.
- Strong sense of responsibility, good learning ability, communication ability, and self-motivation.
- Good communication and collaboration skills, able to explore new technologies with the team and promote technological progress.
- Commit to an onboarding date by end of year 2026