XPENG is looking to develop the core brain for its end-to-end autonomous driving systems by creating a next-generation Vision-Language-Action (VLA) Foundation Model.
Requirements
- Experience in multi-modal modeling (vision, language, or planning), with deep understanding of representation learning, temporal modeling, and reinforcement learning techniques.
- Strong proficiency in PyTorch and modern transformer-based model design.
- Prior experience building foundation or end-to-end driving models, or LLM****/VLM architectures (e.g., ViT, Flamingo, BEVFormer, RT-2, or GRPO-style policies).
- Knowledge of RLHF/DPO/GRPO, trajectory prediction, or policy learning for control tasks.
- Familiarity with distributed training (DDP, FSDP) and large-batch optimization.
Responsibilities
- Conduct research on designing and implementing large-scale multi-modal architectures (e.g., vision–language–action transformers) for end-to-end autonomous driving.
- Design and integrate cross-modal alignment (e.g., visual grounding, temporal reasoning, policy distillation, imitation and reinforcement learning) to improve model interpretability and action quality.
- Closely collaborate with researchers and engineers across the modeling and infrastructure team.
- Contribute to top-tier AI/CV/ML conferences publications and present research findings.
Other
- Currently enrolled in the Master/Ph.D program in Computer Science, Electrical/Computer Engineering, or related field, with the specialization in the CV/NLP/ML.
- Publication record in top-tier AI conferences (CVPR, ICCV, ECCV, NeurIPS, ICLR, ICML, etc).