Wayve is seeking skilled engineers to join their Training Tech team to optimize large-scale training jobs, aiming to scale their models by the next order of magnitude and increase the efficiency of training jobs to allow Wayve to train larger models faster.
Requirements
- Experience optimize large scale training jobs on GPU compute clusters.
- Experience in working in platform teams and working with research teams.
- Experience in reporting and tracking over time benchmarked performance in an open and accessible way.
- Ability to write high quality, well-structured and tested Python code
- Solid experience working with concurrent, parallel and distributed computing.
- Experience using Nvidia NSight Systems.
- Experience implementing GPU kernels.
Responsibilities
- Profile training jobs to identify their bottlenecks, e.g. using NVIDIA Nsight Systems
- Design and implement efficiency improvements to maximise MFU, e.g. tensor parallelism, model compilation, mixed precision
- Design and implement observability tools, e.g. to track MFU
- Collaborate closely with Research teams to integrate training efficiency improvements and create a culture of performance optimization
Other
- BS or MS in Machine Learning, Computer Science, Engineering, or a related technical discipline or equivalent experience
- Full-time role based in our office in Sunnyvale.
- Hybrid working policy that combines time together in our offices and workshops to fuel innovation, culture, relationships and learning, and time spent working from home.
- Operate core working hours so you can determine the schedule that works best for you and your team.