The Seed Multimodal Interaction and World Model team is dedicated to developing models that boast human-level multimodal understanding and interaction capabilities. The team also aspires to advance the exploration and development of multimodal assistant products.
Requirements
- Publications in top-tier venues, such as CVPR, ECCV, ICCV, NeurIPS, ICLR, ICML, or other leading conferences in AI and ML
- Strong research background in at least one of the following: generative modeling (e.g., diffusion models, transformers), multimodal learning, or representation learning
- Solid engineering and modeling skills, with experience building and training large-scale ML systems
- Experience in building or training models for both generative and discriminative tasks
- Familiarity with joint modeling strategies (e.g., multitask learning, contrastive alignment, autoregressive decoding for understanding)
- Background in video generation, vision-language pretraining, or instruction-conditioned generation
- Interest in long-context modeling, memory architectures, or world modeling tasks
Responsibilities
- Develop and evaluate unified modeling architectures for multimodal foundation models across vision, audio, and language
- Contribute to building a shared representation space that supports both generation and understanding tasks
- Explore architectural and optimization strategies to improve generalization across modalities and tasks
- Collaborate with researchers working on generation, reasoning, and world modeling to scale and adapt models for real-world scenarios
Other
- Currently pursuing a PhD in Software Development, Computer Science, Computer Engineering, or a related technical discipline
- Must obtain work authorization in the country of employment at the time of hire, and maintain ongoing work authorization during employment