XPENG is seeking to optimize model inference and deploy high-performance, large-scale AI models for autonomous driving and beyond.
Requirements
- Strong coding skills in C++ and Python with a focus on performance and scalability.
- Proficient in deploying deep learning models using TensorRT, ONNX Runtime, or TVM.
- Familiarity with CUDA programming and parallel computing principles.
- Solid understanding of model inference workflows and system-level performance tuning.
- Experience in quantization-aware training or post-training quantization.
- Hands-on experience with deploying vision-language or large multimodal models.
- Familiarity with low-precision inference (INT8/FP16), kernel fusion, and operator-level optimization.
Responsibilities
- Optimize large-scale multimodal models for low-latency inference and efficient memory usage across diverse hardware platforms.
- Apply state-of-the-art model compression techniques, including quantization (e.g., INT8/FP16), pruning, and knowledge distillation.
- Develop and integrate custom inference kernels targeting GPU or custom AI accelerators.
- Build profiling tools and performance models to analyze bottlenecks and guide optimization strategies.
- Contribute to real-world deployment efforts in autonomous driving systems, including on-vehicle testing and iteration.
- Track the latest research in efficient ML inference and integrate relevant techniques into production pipelines.
Other
- Master’s or Ph.D. in Computer Science, Electrical Engineering, or related field.
- Effective communicator and collaborative team player.
- Track record of open-source contributions or publications in ML/AI conferences (e.g., NeurIPS, ICML, CVPR).
- Background in system profiling, latency modeling, or compiler-level optimization.