Box needs to rigorously evaluate Large Language Models (LLMs) and Box AI Agents for enterprise-grade quality, reliability, and trust to transform how organizations work with content and empower customers to transform workflows.
Requirements
- 7+ years in machine learning or applied AI, including 2+ years managing engineers with a track record of coaching, hiring, and performance development.
- Practical experience evaluating and/or deploying ML systems at scale; you've designed metrics, datasets, or pipelines that informed product decisions.
- Strong analytical problem solver who works confidently with large, complex datasets and ambiguous problem spaces.
- Proficient in at least one programming language (e.g., Python, C++, Java, or R) and familiar with modern ML frameworks (e.g., PyTorch, TensorFlow, scikit-learn, NumPy, pandas).
- Experience with LLMs and RAG
- Depth in IR/NLP/query understanding
- Familiarity with Vertex AI, AWS Bedrock/SageMaker; exposure to Kubernetes-based systems.
Responsibilities
- Lead and mentor a team of ML engineers to design, build, and operationalize an evaluation framework for Box AI Agents and foundational LLMs.
- Define representative enterprise datasets and metrics; develop grading approaches that assess accuracy, safety, grounding, and usability.
- Pioneer and evangelize LLM evaluation methodology tailored to enterprise content management use cases.
- Collaborate with AI Platform teams to translate evaluation results into roadmap decisions and measurable agent improvements.
- Partner with model providers (e.g., OpenAI, Google, Anthropic) to share findings and influence model capabilities for enterprise needs.
- Track AI research and industry trends to continuously evolve our evaluation strategy and tooling.
- Manage and coordinate the team's on-call rotation to ensure timely and effective incident response, actively participate in escalated on-call incidents to provide leadership and support, and drive improvements by addressing recurring issues to minimize disruptions.
Other
- We are an AI-first company. This means you approach your work with a growth mindset and find ways to leverage AI to help make faster, smarter decisions that will 10X your impact at Box.
- Excellent communicator who collaborates effectively across product, research, and platform teams and with external partners.
- Work with senior leadership (CEO, CTO) to set priorities and a clear one-year roadmap for Model Foundations.
- Preferred: Proven roadmap planning where short-term wins ladder into a long-term vision.
- Boxers are expected to work from their assigned office a minimum of 3 days per week.