Ragas reposted this
Was reading Shahul's excellent blog on aligning LLM judges with Human Experts, and came across Shreya's paper on EvalGen. TL;DR: Problem: LLMs are increasingly used to evaluate other LLM outputs, but these LLM-based evaluators can be unreliable and require validation. Existing tools lack sufficient support for verifying the quality of LLM-generated evaluations. Users struggle to define evaluation metrics for custom tasks. Proposed Solution: EvalGen: A mixed-initiative interface that assists users in creating and validating LLM-based evaluations. Workflow: LLM suggests evaluation criteria based on the prompt under test. LLM generates candidate assertions (code or LLM prompts) for each criterion. Users grade a subset of LLM outputs, providing feedback. EvalGen selects assertions that best align with user grades. A report card shows the alignment between chosen assertions and user grades. Key Features: Criteria Generation: LLM-powered suggestions for evaluation criteria. Assertion Synthesis: LLM-generated candidate implementations (code or LLM prompts). Active Learning: User grades guide the selection of aligned assertions. Alignment Measurement: Report card showing the alignment of assertions with user preferences. Mixed-Initiative: Combines automated assistance with user control. Evaluation: Offline Evaluation: Compared EvalGen's algorithm with SPADE (a fully automated assertion generation tool). EvalGen achieved better alignment with fewer assertions due to human input in the criteria selection stage. Qualitative User Study: Nine industry practitioners used EvalGen to build evaluators for LLM pipelines. As businesses scale AI first workflows, evals grow even more critical. Combining humans and LLMs seems like an excellent fit.