Build generative models of the 3D world to power creative applications, visual reasoning, simulation, planning for embodied agents, and real-time interactive experiences.
Requirements
- Experience with large-scale transformer models and/or large-scale data pipelines.
- Exceptional engineering skills in Python and deep learning frameworks (e.g., Jax, TensorFlow, PyTorch), with a track record of building high-quality research prototypes and systems.
- Demonstrated experience in large-scale training of multimodal generative models.
- Experience building training codebases for large-scale video or multimodal transformers.
- Expertise optimizing efficiency of distributed training systems and/or inference systems.
- Strong background in 3D representations or 3D computer vision
- Track record of releases, publications, and/or open source projects relating to video generation, world models, multimodal language models, or transformer architectures.
Responsibilities
- Conduct research to build generative multimodal models of the 3D world.
- Solve essential problems to train world models at massive scale
- develop metrics for spatial intelligence
- curate and annotate training data
- enable real-time interactive experiences
- explore downstream applications
- study integration of spatial modalities with multimodal language models
Other
- MSc or PhD in computer science or machine learning, or equivalent industry experience.
- A keen eye for visual aesthetics and detail, coupled with a passion for creating high-quality, visually compelling generative content.
- Strong publication record at top-tier machine learning, computer vision, and graphics conferences (e.g., NeurIPS, ICLR, ICML, SIGGRAPH, CVPR, ICCV).