Wayve is looking to shape how its foundation models for embodied AI are evaluated. The company needs to develop an offline evaluation suite with robust, interpretable metrics for various tasks, and ensure high-quality ground truth data through human annotation. This work will directly influence the understanding, trust, and deployment of their AI models in real-world automated driving systems.
Requirements
- ~5+ years of relevant industry, including experience designing and implementing offline evaluation pipelines for ML models (vision, multimodal, or embodied AI)
- Strong software engineering skills, including Python, data processing, and ML tooling
- Hands-on experience with human annotation workflows (task design, QA, coordination) as well as working with internal and external annotation partners
- Deep understanding of metrics and benchmark design to evaluate complex model behavior and a desire to build more.
- Experience with foundation models (LLMs or VLMs) and their evaluation
- Background in autonomous systems, robotics, or embodied AI domains
- Contributions to public benchmarks, datasets, or evaluation frameworks (e.g., nuScenes, AV2, Ego4D, ALFRED)
Responsibilities
- Build and scale offline evaluation pipelines for embodied AI models
- Design and implement benchmarks and metrics across vision, language, and driving tasks
- Lead and coordinate human annotation workflows, including quality assurance and task design
- Collaborate cross-functionally with science, datasets, and engineering teams
- Analyze offline metrics and correlate them with online performance to inform deployment readiness
Other
- Ability to collaborate cross-functionally in fast-paced, high-ownership environments
- Relocation support with visa sponsorship
- Flexible working hours
- Hybrid working policy