The Seed Multimodal Interaction and World Model team is dedicated to developing models that boast human-level multimodal understanding and interaction capabilities. The team also aspires to advance the exploration and development of multimodal assistant products.
Requirements
- Currently pursuing a PhD in Computer Vision, Machine Learning, or a related technical field.
- Familiarity with multimodal modeling, world models, or foundation model pretraining.
- Strong coding skills and hands-on experience with PyTorch or JAX.
- Experience with large-scale distributed training frameworks and GPU/TPU compute stacks.
- Demonstrated research ability, with publications in top-tier conferences such as CVPR, ICCV, ECCV, NeurIPS, ICML, or ICLR.
- Experience working with transformer-based architectures, including dense and Mixture-of-Experts (MoE) models.
- Understanding of scaling behavior in foundation models and how to analyze them.
Responsibilities
- Contribute to research and engineering to advance world models and multimodal understanding, enhancing models' reasoning and generation capabilities.
- Design and prototype novel architectures that balance modeling performance, generalization, and efficiency.
- Help establish scaling laws and conduct systematic ablations to derive transferrable insights across model families and tasks.
Other
- Currently pursuing a PhD in Computer Science, Machine Learning, or a related technical field.
- Applications will be reviewed on a rolling basis – we encourage you to apply early.
- Please state your availability clearly in your resume (Start date, End date).