ChatGPT能解决指代消解吗？_让gpt处理指代消解问题-CSDN博客

本文探讨了ChatGPT在WinogradSchemaChallenge中的性能，这是一种测试AI模型常识推理能力的难题。实验结果显示，尽管未经过微调，ChatGPT的准确率约为73%，略优于随机，但仍远低于人类水平，表明在处理这类需要世界知识的任务时，模型仍有提升空间。

摘要由CSDN通过智能技术生成

挑战介绍

介绍下指代消解问题，也称指代消歧，英文为Winograd Schema Challenge (WSC)，是为了测试AI模型的常识推理而引入的，它一般是在一个句子中找到代词指代的对象，推理过程中需要用到世界知识和常识。

下面是该问题是例子：
case1:

市政府拒绝给示威者颁发游行许可证，因为[担心/鼓吹]暴力事件。 谁[担心/鼓吹]暴力事件?
答案: 市政府/示威者

case2：

奖杯无法放进到棕色的箱子里，因为它太[小/大]了。 什么东西太[小/大]了?
答案: 箱子/奖杯

这是一个由计算语言学先驱特里·维诺格拉德(Terry Winograd)在20世纪70年代提出的老问题！这是一个很好的问题，因为它涵盖了人类语言处理中真正困难的许多方面，需要一定程度的世界知识来解决。事实上，到目前为止，WSC273数据集(Kocijan, et al. 2023)的最佳结果仅在90%左右（Kocijan等人，2023年），而人类通常得分约为100%。

最成功的方法使用经过精细调整的大型预训练语言模型，这些模型已在特别设计的训练集上进行了微调。这些训练/测试集通常很小，因为如何创建或捕捉好的例子并不明显。正如一位研究人员指出的那样（Kocijan等人，2020年），有时解决代词所需的知识可能比其他情况更抽象或复杂。

到目前为止，ChatGPT惊人的生成能力大家已经非常熟悉。我们将在本文中回答的问题是，没有任何微调，ChatGPT是否能够比文献中描述的一些成功方法得分更高。我们将进行实验。

实验设置

首先，要运行这篇帖子中的（Python）代码，您需要前往ChatGPT网站获取API密钥以使用该模型。尽管在复制本帖中的实验时不需要付费版本，但您可能需要分批次示例以避免达到每分钟60次的限制

他们提供了一些语言模型，gpt-4是当前最好的模型，所以我们将使用它。我们将使用Hugging Face上的数据集，因为它已经为我们打包好了。以下是获取OpenAI模型和Hugging Face数据集访问权限的代码：

from datasets import load_dataset
import openai
import os

openai.api_key = os.getenv("OPENAI_API_KEY")
wsc273_dataset = load_dataset('winograd_wsc', name='wsc273',  split='test')

这是数据集中的273个示例之一，与上面的示例类似：

{‘text’: “The trophy doesn’t fit into the brown suitcase because it is too large.”, ‘pronoun’: ‘it’, ‘pronoun_loc’: 55, ‘quote’: ‘it is too large’, ‘quote_loc’: 55, ‘options’: [‘the trophy’, ‘the suitcase’], ‘label’: 0, ‘source’: ‘Hector Levesque’}

说明：当前数据集只有英文的，大多数国内模型均支持中英文，评估推理能力用此英文数据集也可以。

接下来，我们需要一种方法将测试集中的数据做下转换，变为ChatGPT模型的输入prompt。转换代码如下。

def query_format_helper(pronoun: str, answers: List[str]) -> str:
    return f'Q: In the previous statement, does "{pronoun}" refer to {answers[0]} or {answers[1]}? A:'


def construct_query_from_schema(
        text: str, pronoun: str, answers: List[str]
) -> str:
    return f"S:{text} {query_format_helper(pronoun, answers)}"

c = 0
for example in wsc273_dataset:
    query = construct_query_from_schema(example['text'], example['pronoun'], example['options'])
    print(query)
    break

基于API用模型预测：

def get_openai_answer(query_prompt: str, add_leading: bool = True) -> str:
    init_prompt_starter = "The following are pairs of Winograd Schema in the form of a statement S, a question Q, and an answer A:"
    init_prompt1 = "S: The cat went through the door, but it's tail got stuck. Q: In the previous statement, what does 'it' refer to? A: The cat."
    init_prompt2 = "S: The cat tried to go through the door, but it was too small. Q: In the previous statement, what does 'it' refer to? A: The door."
    init_prompt3 = "S: Fedex made more profit than UPS last year, but that was mostly due to the success of the new delivery system they implemented. Q: In the previous statement, what does 'they' refer to? A: Fedex."
    init_prompt4 = "S: Sam tried to buy Xerxes lunch, but he wouldn't allow it. Q: In the previous statement, who does 'he' refer to? A: Xerxes."

    # add leading prompts to cue model...
    if add_leading:
        query_prompt = f"{init_prompt_starter} {init_prompt1}, {init_prompt2}, {init_prompt3}, {init_prompt4}, {query_prompt}"

    response = openai.ChatCompletion.create(
        model="gpt-4",
        prompt=query_prompt,
        temperature=0,
        max_tokens=200,
        top_p=1,
        frequency_penalty=0.0,
        presence_penalty=0.0,
        stop=["\n"]
    )

    return response['choices'][0]['text']

完整代码：https://github.com/shibing624/tools/blob/main/gpt_demo/wsc_demo.ipynb

评测结果

好的，现在我们只需要运行一个循环，将示例传递给模型，如果答案在输出中（均转换为小写），则将输出评分为正确，否则评分为不正确。我们将使用准确率作为我们的评估指标，以保持与文献的一致性，计算方法为正确数量 / 总示例数量。我没有包括程序的这部分，因为一旦你设置好其他部分，这部分相对比较简单。是时候运行代码了！

结果显示ChatGPT在这项任务上的准确率约为73%，优于随机机会，也优于过去10年中的大多数尝试，但并不接近最先进水平。也许你可以通过一些微调或巧妙的命令链提示做得更好，但在第一种情况下数据是有限的，在第二种情况下提取相关信息并不直接（即文本本身不包含空间和物体行为的概念）。另一个好主意是使用在NLI数据上预训练/微调的模型，而不仅仅是大量不相关的英文文本，但它必须能够将世界知识的有用表示学习到其潜在空间中。

这是模型预测错误的一个例子：

The lawyer asked the witness a question, but he was reluctant to repeat it.

模型答复：

I’m sure you could imagine a scenario where the “he” was referring to either the lawyer or the witness.

实际应该是：the lawyer

为了进行良好的实践，这些实验运行了多次，准确度只有微小的差异（<0.1）。此外，还尝试了对提示进行一些增强，如在传递提示的代码中所示，我们通过一些示例稍微引导了模型，但准确度并没有真正改变。

最近发布了一篇论文ChatGPT: Jack of all trades, master of none，显示出ChatGPT在许多方面表现良好，但没有在任何方面表现出色，杂而不精。这样看来在指代消解挑战上，ChatGPT也表现出这种杂而不精的情况！