The company is looking to build and evolve internal evaluation frameworks for Generative AI systems, particularly Large Language Models, to help users make sense of complex observability data through AI-driven features.
Requirements
- Experience designing and implementing evaluation frameworks for AI/ML systems
- Familiarity with prompt engineering, structured output evaluation, and context-window management in LLM systems
- High autonomy to collaborate and translate team goals into clear, testable criteria supported by effective tooling
- Experience working in environments with rapid iteration and experimental development
- Familiarity with CI/CD workflows and automated testing
Responsibilities
- Design and implement robust evaluation frameworks for GenAI and LLM-based systems
- Develop tooling to enable automated, low-friction evaluation of model outputs, prompts, and agent behaviors
- Define and refine metrics for both structure and semantics, ensuring alignment with realistic use cases and operational constraints
- Lead the development of dataset management processes and guide teams across Grafana in best practices for GenAI evaluation
Other
- Passion for minimizing human toil and building AI systems that actively support engineers
- Pragmatic mindset that values reproducibility, developer experience, and thoughtful trade-offs when scaling GenAI systems
- Experience working in a remote environment, USA time zones only
- Bachelor's degree or higher in Computer Science or related field (not explicitly mentioned but implied)
- Travel requirements not mentioned, but may be required for company events or meetings