Rivian is looking to establish a state-of-art ML infrastructure for training and inference of large autonomous driving models and optimize their performance.
Requirements
- Deep knowledge of PyTorch
- Knowledge of model training framework (e.g. PyTorch Lightning, ray, etc.)
- In-depth knowledge of transformer architecture and ways to accelerate the training and inference of transformer models
- Experience of performing large scale distributed training of models
- A track record of profiling models and doing detective work to improve model training and inference speed
- Experience with CUDA or Triton language for writing custom ops
- Knowledge of Nvidia TensorRT
Responsibilities
- Optimize the performance of Deep Learning training workload on NVIDIA GPU systems on a large scale
- Optimize the latency of model inference and model pre- and post-processing on onboard systems
- Design, train, and deploy large deep learning models that can leverage the vast amount of labeled and unlabeled data
Other
- PhD in CS/CE/EE, or equivalent, in industry experience
- A track record of efficiently solving complex problems collaboratively on larger teams
- Experience with edge computing systems