Get Jobs Tailored to Your Resume

Filtr uses AI to scan 1000+ jobs and finds postings that perfectly matches your resume

Staff Software Engineer - AI Infrastructure

XPeng Motors

$179,400 - $303,600

Aug 14, 2025

Santa Clara, CA, US

XPENG is looking for a Machine Learning Infrastructure Engineer to build and optimize their next-generation DataLoader and Dataset Management System, which is a core AI infrastructure powering autonomous driving, robotics, and intelligent cockpit teams with large-scale data processing, model training, and inference acceleration.

Requirements

5+ years of experience in large-scale data processing or ML infrastructure.
Proficient in Python with solid software engineering fundamentals, clean coding practices, and strong debugging skills.
Hands-on experience with relational databases and NoSQL systems, including metadata and cache management; prior experience with large-scale VectorDB is highly desirable.
Experience in at least one of the following areas: Large-scale deep learning training or inference optimization focused on scalability and model acceleration (distributed training strategies, quantization, CUDA kernel development, and related optimizations).
Experience in at least one of the following areas: Columnar storage formats (Parquet/ORC) and related ecosystems, including partitioning, compression, and vectorized I/O optimization.
Experience in at least one of the following areas: Linux file system and network I/O optimization for NFS, (high-performance) distributed file systems, and object storage.
Experience in at least one of the following areas: Large-scale data loading frameworks (PyTorch Dataloader, Hugging Face Datasets).

Responsibilities

Design, develop, and maintain high-performance DataLoader SDKs and Dataset Management Systems for multi-source, heterogeneous data (images, videos, point clouds, sensor streams, etc.).
Optimize multi-threaded/multi-process data pipelines for minimal I/O latency and preprocessing overhead, supporting large-scale model training and inference workloads.
Contribute to AI infrastructure projects beyond data loading, including: Distributed training and inference optimization.
Contribute to AI infrastructure projects beyond data loading, including: Custom operator development (CUDA kernels, TensorRT, ROCm) and hardware-specific acceleration for GPU/TPU.
Contribute to AI infrastructure projects beyond data loading, including: Model optimization techniques such as pruning, quantization, distillation, sparsification, and mixed-precision training.
Collaborate with algorithm and platform teams to translate business needs into scalable, production-grade solutions.
Continuously identify and address performance bottlenecks across the AI training and inference stack.

Other

Master’s degree in Computer Science, Software Engineering, or equivalent experience.
Strong communication skills and ability to work cross-functionally in fast-paced environments.
Strong ability to learn quickly, adapt to new challenges, and proactively explore and adopt new technologies.
Familiarity with the autonomous driving industry and enthusiasm for its challenges.
Experience with distributed computing frameworks such as Apache Ray.