ByteDance's Seed Vision Team is focused on foundational models for visual generation, aiming to develop multimodal generative models and conduct leading research to solve fundamental computer vision challenges in GenAI.
Requirements
- Research experience in multimodal learning, large-scale pretraining, or vision-language modeling.
- Proficiency in deep learning frameworks such as PyTorch or JAX.
- Demonstrated ability to conduct independent research, with publications in top-tier conferences such as CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR.
- Experience with autoregressive LLM training, especially in multimodal or unified modeling settings.
- Familiarity with instruction tuning, vision-language generation, or unified token space design.
- Background in model scaling, efficient training, or data mixture strategies.
- Ability to work closely with infrastructure teams to deploy large-scale training workflows.
Responsibilities
- Conduct research on joint training of vision, language, and video models under a unified architecture.
- Develop scalable and efficient methods for autoregressive-style multimodal pretraining, supporting both understanding and generation.
- Explore cross-modal tokenization, alignment, and shared representation strategies.
- Investigate instruction tuning, captioning, and open-ended generation capabilities across modalities.
- Contribute to system-level improvements in data curation, model optimization, and evaluation pipelines.
- Researching and developing foundational models for visual generation (images and videos), ensuring high interactivity and controllability in visual generation, understanding patterns in videos, and exploring various visual-oriented tasks based on generative foundational models.
Other
- Currently pursuing a PhD in Computer Vision, Machine Learning, NLP, or a related field.
- Please state your availability clearly in your resume (Start date, End date).