The company is looking to bring distributed ML training research into production and create protocols for machine intelligence to flourish alongside human intelligence.
Requirements
- Modern parallelisation frameworks for training (e.g. FSDP, Megatron-LM, DeepSpeed)
- Frameworks for production scale inference (e.g. ONNX Runtime, TensorRT, DeepSpeed-Inference, NVIDIA Triton, and TorchServe)
- Deep theoretical knowledge of deep learning or distributed systems
- Experience with common networking protocols (IP, TCP, UDP, HTTP) and communication backends (NCCL, GLOO, MPI)
- Experience with compiler design
- Strong systems programming experience (especially Rust)
- Meaningful exposure to decentralised communication, distributed consensus, blockchains
Responsibilities
- Productionise advanced ML parallelisation and verification frameworks
- Convert novel hybrid parallelisation and verification research into production code
Other
- Comfortable working in environments with a heavy research component
- Experience working in high-growth start/scale-up environments
- Autonomy & Independence
- Rejection of mediocrity & high performance
- Fully remote work
- Relocation Assistance
- 3-4x all expenses paid company retreats around the world, per year
- Paid sick leave
- Private health, vision, and dental insurance