高效大模型答案抽取器xFinder：适用于各类LLM评估框架-CSDN博客

论文题目：

xFinder: Robust and Pinpoint Answer Extraction for Large Language Models

作者单位：

上海算法创新研究院，中国人民大学

论文地址：

https://arxiv.org/abs/2405.11874

代码地址：

https://github.com/IAAR-Shanghai/xFinder

LLM可靠评估

大规模语言模型（LLM）的蓬勃发展催生了对 LLM 进行全面，高效，准确的基准评测（Benchmarking）需求。一些统一的评测框架，比如 HuggingFace Open LLM Leaderboard 的后端 LM Eval Harness，国内的 OpenComapass，清华大学的 UltraEval 等框架应运而生。

然而，他们的框架可靠性如何呢？我们评测一个模型的好坏，最重要的就是想要知道模型 A 和模型 B 谁好谁坏，然而现在刷榜现象横生，LLM 评估的可靠性显然是不足的。因此，这篇工作对 LLM 评估框架中的可靠性进行了完整地剖析，并以一个完整的、典型的流程为例来展示，如下图所示。

1. Data Item Transformation：来自各种各样的评测数据集（例如，MMLU，GSM8K，TruthfulQA 等）中的问题都会被转换成相似的问题-回答类型，比如以字母为选项的单项选择题。这一步的不可靠性在于，将所有的问题都转换为字母选项本身就是不可靠的，因为许多模型对字母选项有偏见，模型对字母选择的稳定性没有对选项内容的稳健性高。

2. Question Prompting & LLM Answering：依次把各种问题以某种特定的 Prompt 形式（例如，Few-Shot，CoT 等）来向大模型发起请求，模型回答这个问题。这一步的不稳定因素就更多了。

不同的 LLM 本身可能就更擅长某种 Prompt 格式，细微的 prompt 变化可能都会导致最终评测结果发生大的改变；模型本身训练集中是否泄露了测试任务的测试集也会产生很大的影响，这在今天的评测工作中尤为明显；温度系数，解码策略等也显著影响模型生成的内容，Self-Consistency 的论文已经表明了模型不能够稳定地进行结果地输出，因此解码中的不稳定性因素是很多的。

3. Key Answer Extraction & Matching：收集到了 LLM 的回答之后，往往还要提取其中的真正选择。模型有可能输出的是 “I'm willing to help. My answer to the question is xxx. Explanation: xxxxxx. Therefore, I choose option B.” 如此以来，就需要提取这个步骤，进而与正确答案做匹配。这个步骤所有框架都会使用正则表达式来提取，然而作者找到了许多有力的证据表明正则表达式提取失败的案例，如下图所示。

4. Aggregation：有了单个数据项模型回答效果的评定，还需要聚合在一起。现有的评估框架往往只提供了准确率，最多再提供标准差作为参考。然而，这样的简单的统计学聚合是低效的。比如说，对于正则表达式提取不到的内容，简单的忽略掉或者认为模型就是回答错误了是不合适的，提供更丰富的指标对评测结果的有效性提升是极有必要的。

关键答案提取

对于上面提到的问题，作者们抓住其中最现实且最关键的一点，LLM Answering Extraction & Matching 这一步，通过微调构建了一个高效的关键答案提取器，xFinder。制作这个 xFinder 的核心是构造一个面向提取任务的高效的数据集，这个数据集的 X 就是各种测试任务的问题，对应的 LLM 的响应，数据集的 Y 就是提取出的结果，下面是一些示例：

1{
 2  'key_answer_type': 'alphabet option',
 3  'question': 'A man is seen playing guitar on a stage with others playing instruments behind him. The man grabs a guitar from the audience and begins playing both one after the other ...',
 4  'llm_output': 'Option A is the correct choice as it describes ...',
 5  'standard_answer_range': '[['A', 'strums the guitar in the end, continues playing the guitar with the crowd following him as well as lining up next to him.'], ['B', 'continues playing the instruments and ends by waving to the crowd and walking off stage.'], ['C', 'then turns to the audience and gives a stuffed toy to the audience and continues playing.'], ['D', 'finally stops playing and moves his hands for the crowd to see.']]',
 6  'gold_label': 'A',
 7  'xFinder_output': 'A',
 8},
 9{
10  'key_answer_type': 'short text',
11  'question': 'If you really wanted a grape, where would you go to get it? Answer Choices: winery / fruit stand / field / kitchen / food',
12  'llm_output': 'The answer is winery / fruit stand / field / kitchen / food ...',
13  'standard_answer_range': '[\'winery\', \'fruit stand\', \'field\', \'kitchen\', \'food\']',
14  'gold_label': '[No valid answer]',
15  'xFinder_output': '[No valid answer]',
16},
17{
18  'key_answer_type': 'categorical label',
19  'question': 'How tall is the Sears Building ?',
20  'llm_output': 'The Sears Building is a specific structure, so the answer would be a Location ...',
21  'standard_answer_range': '['Abbreviation', 'Entity', 'Description', 'Person', 'Location', 'Number']',
22  'gold_label': 'Location',
23  'xFinder_output': 'Location',
24},
25{
26  'key_answer_type': 'math',
27  'question': ' Mike made 69 dollars mowing lawns over the summer. If he spent 24 dollars buying new mower blades, how many 5 dollar games could he buy with the money he had left? ',
28  'llm_output': 'To find out how many 5 dollar ... Let's calculate that:\n\n$45 / $5 = 9\n\nSo, Mike could buy 9 5 dollar games with the money he had left.',
29  'standard_answer_range': 'a(n) number / set / vector / matrix / interval / expression / function / equation / inequality',
30  'gold_label': '9',
31  'xFinder_output': '9',
32}

同时，作者还对于关键答案提取这个任务进行了明确的数学定义（如下），在这里不做过多讲解。

xFinder构建

xFinder 具体的构建可以分为三个阶段。

阶段一：LLM Response Generation。作者使用 10+ LLM，在 10+ Benchmarks 上得到了海量的 LLM 的输出内容。通过这种方式，xFinder 能够尽量的泛化。

阶段二：Auto Labelling and Human Recheck。接着为了标注出关键答案部分，作者使用机器标注和人工标注相结合的办法。首先使用 Self-Consistency 策略询问 GPT-4，LLM Response 中的关键答案是什么。如果不一致，则交由人工进行标注。如此以来，构成最终标注好的 Key Answer Finder（KAF）数据集

阶段三：Training xFinder。使用 QLoRA 在 KAF 数据集上，对众多底座 LLM 进行有监督的微调。这些底座模型包括 LLaMA 系列，Qwen 系列，Gemma 系列等。

实验分析

作者进行了三个实验，首先是通过在 KAF 的测试集合上测试模型效果好坏（效果自然是好的，不再赘述）；其次是重新生成一个数据集，使用与生成 KAF 时不太相同的 LLM 和测试任务生成。用这个新的数据集来测试 xFinder 的泛化能力，因为这些新生成的数据项大多是未见过的；最后是，把 xFinder 放在评测框架中，不观察提取准确率，观察最终的评测得分和用其他评测框架分数的区别。

对于第二个实验，作者的实验结果如下。可以发现 xFinder 的提取准确率比 GPT-4 以及其他基于正则表达式的基线都要高出不少，证明其泛化性很好。