1X is looking to build the systems that let every team and every robot go faster by enabling more frequent training, more reliable evaluation, and better model deployment to their growing fleet of humanoid robots.
Requirements
- Linux
- Python / C++
- PyTorch / TorchTitan / TensorRT
- Triton / CUDA
Responsibilities
- build the systems that let every team and every robot go faster: training more often, evaluating more reliably, and deploying better models to our growing fleet.
- transform prototypes into production-scale infrastructure for learning and inference, enabling larger training runs and maximizing edge compute utilization to make our models more capable.
- High agency and ownership on scaling capabilities in distributed training and/or inference
- Ensure that compute is never the bottleneck, i.e. we always have more compute available than data
- Enable large-scale (1000+ GPU) training on billion frames+ of robot data, from fault tolerance to distributed ops to experiment management
- Optimize high-throughput datacenter scale distributed inference for world models: work on the world's fastest diffusion inference engine
- Improve low-latency on-device inference for a variety of robot policies with quantization, scheduling, distillation and more
Other
- Target start date: Immediately.
- Relocation provided.
- Candidates are expected to be in-person at the office.