Job Board
LogoLogo

Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

XPeng Motors Logo

Staff Software Engineer - AI Infrastructure

XPeng Motors

$179,400 - $303,600
Aug 14, 2025
Santa Clara, CA, US
Apply Now

XPENG is looking for a Machine Learning Infrastructure Engineer to build and optimize their next-generation DataLoader and Dataset Management System, which is a core AI infrastructure powering autonomous driving, robotics, and intelligent cockpit teams with large-scale data processing, model training, and inference acceleration.

Requirements

  • 5+ years of experience in large-scale data processing or ML infrastructure.
  • Proficient in Python with solid software engineering fundamentals, clean coding practices, and strong debugging skills.
  • Hands-on experience with relational databases and NoSQL systems, including metadata and cache management; prior experience with large-scale VectorDB is highly desirable.
  • Experience in at least one of the following areas: Large-scale deep learning training or inference optimization focused on scalability and model acceleration (distributed training strategies, quantization, CUDA kernel development, and related optimizations).
  • Experience in at least one of the following areas: Columnar storage formats (Parquet/ORC) and related ecosystems, including partitioning, compression, and vectorized I/O optimization.
  • Experience in at least one of the following areas: Linux file system and network I/O optimization for NFS, (high-performance) distributed file systems, and object storage.
  • Experience in at least one of the following areas: Large-scale data loading frameworks (PyTorch Dataloader, Hugging Face Datasets).

Responsibilities

  • Design, develop, and maintain high-performance DataLoader SDKs and Dataset Management Systems for multi-source, heterogeneous data (images, videos, point clouds, sensor streams, etc.).
  • Optimize multi-threaded/multi-process data pipelines for minimal I/O latency and preprocessing overhead, supporting large-scale model training and inference workloads.
  • Contribute to AI infrastructure projects beyond data loading, including: Distributed training and inference optimization.
  • Contribute to AI infrastructure projects beyond data loading, including: Custom operator development (CUDA kernels, TensorRT, ROCm) and hardware-specific acceleration for GPU/TPU.
  • Contribute to AI infrastructure projects beyond data loading, including: Model optimization techniques such as pruning, quantization, distillation, sparsification, and mixed-precision training.
  • Collaborate with algorithm and platform teams to translate business needs into scalable, production-grade solutions.
  • Continuously identify and address performance bottlenecks across the AI training and inference stack.

Other

  • Master’s degree in Computer Science, Software Engineering, or equivalent experience.
  • Strong communication skills and ability to work cross-functionally in fast-paced environments.
  • Strong ability to learn quickly, adapt to new challenges, and proactively explore and adopt new technologies.
  • Familiarity with the autonomous driving industry and enthusiasm for its challenges.
  • Experience with distributed computing frameworks such as Apache Ray.