Elastic, the Search AI Company, is looking to solve the problem of enabling everyone to find the answers they need in real time, using all their data, at scale, by building a conversational (agentic) platform that lets customers chat with their own data in Elasticsearch.
Requirements
- 3 to 5 years in applied DS or ML with production ownership, including at least 1 to 2 years focused on evaluating LLM or agent workflows in shipped systems
- Proven experience designing and running stepwise evaluations for agent pipelines: retrieval coverage, reranking quality, reasoning traces, tool selection accuracy, citation grounding, and final answer helpfulness and faithfulness
- Golden set hygiene: stratified dataset design, leakage controls, reviewer guidelines, inter-rater checks, and versioned labels
- Fluent with offline IR metrics and guardrails: Recall at k, nDCG, MRR, groundedness or citation support, plus latency and cost tracking; can move from offline gains to online A or B tests
- Practical Elasticsearch experience or a similar search system; ES|QL familiarity is a plus
Responsibilities
- Own well scoped pieces of the offline and online evaluation pipeline for agent workflows: retrieval coverage, reranking quality, reasoning traces, tool selection accuracy, citation integrity, and final answer helpfulness and faithfulness
- Calibrate and validate LLM-as-judge rubrics against human labels, track agreement with statistics, and add periodic checks to prevent drift
- Instrument agent runs with traces so you can localize errors to retrieval, reasoning, tool execution, or grounding, then contribute CI checks that block merges on regressions
- Translate evaluation readouts into product calls such as model choice, routing policy, tool gating thresholds, prompt and chunking updates, and agent customization for Elastic use cases
- Collaborate with backend engineers on contracts for ES|QL, citations, and telemetry schemas, and with PM and UX to land findings in shipped features
- Share outcomes through clear docs, notebooks, and PRs, and contribute utilities that make evaluation faster and more reproducible for the team
Other
- 3 to 5 years in applied DS or ML with production ownership
- Strong written communication and async collaboration habits in a distributed team
- Competitive pay based on the work you do here and not your previous salary
- Health coverage for you and your family in many locations
- Ability to craft your calendar with flexible locations and schedules for many roles