ByteDance's Applied Machine Learning (AML) team needs to push the next-generation AI infrastructure and recommendation platform for ads ranking, search ranking, live streaming, and e-commerce. The Tech Lead, AML Inference will oversee the development and execution of ByteDance’s inference infrastructure, ensuring reliability, scalability, and performance across large-scale distributed systems.
Requirements
- 5+ years of experience in developing and deploying large-scale, distributed systems, with at least 5 years in a leadership or technical lead role.
- Strong programming skills in languages such as C++, Python, or Go.
- Deep understanding of inference frameworks and ML system deployment (e.g., TensorFlow, PyTorch, TensorRT, JAX, MXNet).
- Proven experience optimizing performance for large-scale machine learning systems, including hardware-software co-design, GPU/RDMA acceleration, or HPC techniques.
- Experience leading teams working on high-throughput, low-latency ML serving systems.
- Contributions to open-source ML or systems projects.
- Familiarity with container orchestration, service mesh, or cloud-native ML infrastructure.
Responsibilities
- Lead and mentor a team of inference-focused Machine Learning Engineers, setting technical direction and ensuring best practices.
- Drive the design and evolution of distributed inference infrastructure to support feeds, ads, search, and other core ranking models.
- Oversee the development of monitoring, observability, and management tools to ensure reliability and scalability of online inference services.
- Identify and resolve system inefficiencies, performance bottlenecks, and reliability issues, ensuring optimized end-to-end performance.
- Partner with research and product teams to translate requirements into robust and efficient inference solutions.
- Stay at the forefront of advancements in inference frameworks, ML hardware acceleration, and distributed systems, incorporating innovations where impactful.
Other
- Excellent communication and collaboration skills; ability to work across research, engineering, and product teams.
- Experience collaborating with and leading global, cross-functional teams across different time zones.