Labelbox is looking to solve the problem of developing critical infrastructure that powers breakthrough AI models at leading research labs and enterprises, by building a team that can design, build, and productionize evaluation and post-training systems for frontier LLMs and multimodal models.
Requirements
- A strong foundation in AI and machine learning, backed by a Ph.D. or Master’s degree in Computer Science, Machine Learning, AI, or a related field (in progress degrees are acceptable for intern positions).
- A deep understanding of frontier autoregressive and diffusion multimodal models, along with the human and synthetic data strategies needed to optimize them.
- Passion and experience for LLM evaluation and benchmarking.
- Expertise in training data quality construction, measurement and refinement.
- The ability to bridge research and application by interpreting new findings and translating them into functional prototypes.
- Proficiency in Python and experience with deep learning frameworks like PyTorch, JAX, or TensorFlow.
- A track record of publishing in top-tier AI/ML conferences (e.g., NeurIPS, ICML, ICLR, ACL, EMNLP, NAACL) and contributing to the broader research community.
Responsibilities
- Build and own evaluation and benchmark suites for reasoning, code, agents, long-context, and V/LLMs.
- Create post-training datasets at scale: design preference/critique pipelines (human + synthetic), and target hard failures surfaced by evals.
- Experiment and prototype RLHF/RLAIF/RLVR/RM/DPO-style training loops to improve real-world task and agent performance.
- Land research in product: ship improvements into Labelbox workflows, services, and customer-facing evaluation/quality features; quantify impact with customer and internal metrics.
- Engage with customer research teams: run pilots, co-design benchmarks, and share practical findings through internal research reports, blog posts, talks, and published papers.
- Design, build, and productionize evaluation and post-training systems for frontier LLMs and multimodal models.
- Own continuous, high-quality evals and benchmarks (reasoning, code, agent/tool-use, long-context, vision-language, et al.)
Other
- A Ph.D. or Master’s degree in Computer Science, Machine Learning, AI, or a related field (in progress degrees are acceptable for intern positions).
- Exceptional communication and collaboration skills.
- Ability to work in a hybrid model with 2 days per week in office, combining collaboration and flexibility
- Ability to work in a fast-paced and high-intensity environment, perfect for ambitious individuals who thrive on ownership and quick decision-making
- Ability to exercise caution and suspend or discontinue communications if encountering suspicious emails or interactions