XPENG is looking for a Machine Learning Infrastructure Engineer to build and optimize their next-generation DataLoader and Dataset Management System, which is a core AI infrastructure powering autonomous driving, robotics, and intelligent cockpit teams with large-scale data processing, model training, and inference acceleration.
Requirements
- 5+ years of experience in large-scale data processing or ML infrastructure.
- Proficient in Python with solid software engineering fundamentals, clean coding practices, and strong debugging skills.
- Hands-on experience with relational databases and NoSQL systems, including metadata and cache management; prior experience with large-scale VectorDB is highly desirable.
- Experience in at least one of the following areas: Large-scale deep learning training or inference optimization focused on scalability and model acceleration (distributed training strategies, quantization, CUDA kernel development, and related optimizations).
- Experience in at least one of the following areas: Columnar storage formats (Parquet/ORC) and related ecosystems, including partitioning, compression, and vectorized I/O optimization.
- Experience in at least one of the following areas: Linux file system and network I/O optimization for NFS, (high-performance) distributed file systems, and object storage.
- Experience in at least one of the following areas: Large-scale data loading frameworks (PyTorch Dataloader, Hugging Face Datasets).
Responsibilities
- Design, develop, and maintain high-performance DataLoader SDKs and Dataset Management Systems for multi-source, heterogeneous data (images, videos, point clouds, sensor streams, etc.).
- Optimize multi-threaded/multi-process data pipelines for minimal I/O latency and preprocessing overhead, supporting large-scale model training and inference workloads.
- Contribute to AI infrastructure projects beyond data loading, including: Distributed training and inference optimization.
- Contribute to AI infrastructure projects beyond data loading, including: Custom operator development (CUDA kernels, TensorRT, ROCm) and hardware-specific acceleration for GPU/TPU.
- Contribute to AI infrastructure projects beyond data loading, including: Model optimization techniques such as pruning, quantization, distillation, sparsification, and mixed-precision training.
- Collaborate with algorithm and platform teams to translate business needs into scalable, production-grade solutions.
- Continuously identify and address performance bottlenecks across the AI training and inference stack.
Other
- Master’s degree in Computer Science, Software Engineering, or equivalent experience.
- Strong communication skills and ability to work cross-functionally in fast-paced environments.
- Strong ability to learn quickly, adapt to new challenges, and proactively explore and adopt new technologies.
- Familiarity with the autonomous driving industry and enthusiasm for its challenges.
- Experience with distributed computing frameworks such as Apache Ray.