使用大型语言模(LLM)构建系统(七)：评估2

通过运行端到端系统来回答用户查询

这里我们通过一个工具包utils来执行一系列的用户关于电子产品的问题的回复，utils中的函数在的原理在之前的博客中都有说明，这里不再赘述。

customer_msg = f"""
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?"""

products_by_category = utils.get_products_from_query(customer_msg)
category_and_product_list = utils.read_string_to_list(products_by_category)
product_info = utils.get_mentioned_product_info(category_and_product_list)
assistant_answer = utils.answer_user_msg(user_msg=customer_msg,
                                                   product_info=product_info)

这里我们把customer_msg翻译成中文，这样便于大家更好的理解其中的含义：

这里我们首先调用了utils.get_products_from_query方法来获取用户问题中所涉及的产品目录的清单(这个过程需要访问LLM)。由于LLM返回的产品目录清单是字符串型所以我们还需要调用utils.read_string_to_list方法将字符串转换成python的List格式，然后我们查询出所有产品目录清单中所包含的所有产品的具体信息(这个过程不需要访问LLM)，最后我们将用户的问题和具体的产品信息一起发送给LLM，并生产最终的回复。下面我们分别看一下上面那些变量中的具体内容：

最后我们查看LLM的最终回复：

由于客户的问题中涉及了多个电子产品,因此LLM的回复的信息量有点多，我们如何来评估LLM的回复是否符合要求呢？

根据提取的产品信息，使用评分标准评估LLM对用户的回答

这里我们需要制定一个评分标准来评估LLM的最终回复是否合格，评估的依据来自于客户的问题，具体的产品信息以及LLM的最终回复这三部分内容。也就是说我们需要把这3部分内容整合在一起然后让LLM再来评估自己之前的回复是否合格,下面我们定义一个eval_with_rubric函数来实现这些功能：

def eval_with_rubric(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    context = test_set['context']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by looking at the context that the customer service \
    agent is using to generate its response. 
    """

    user_message = f"""\
You are evaluating a submitted answer to a question based on the context \
that the agent uses to answer the question.
Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Context]: {context}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the context. \
Ignore any differences in style, grammar, or punctuation.
Answer the following questions:
    - Is the Assistant response based only on the context provided? (Y or N)
    - Does the answer include information that is not provided in the context? (Y or N)
    - Is there any disagreement between the response and the context? (Y or N)
    - Count how many questions the user asked. (output a number)
    - For each question that the user asked, is there a corresponding answer to it?
      Question 1: (Y or N)
      Question 2: (Y or N)
      ...
      Question N: (Y or N)
    - Of the number of questions asked, how many of these questions were addressed by the answer? (output a number)
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

这里我们将system_message和user_message翻译成中文，这样便于大家更好的理解：

接下来我们来评估LLM先前的回复：

cust_prod_info = {
    'customer_msg': customer_msg,
    'context': product_info
}

evaluation_output = eval_with_rubric(cust_prod_info, assistant_answer)
print(evaluation_output)

这里我们定义了一个评估的标准，也就是让LLM来回答我们在user_message 中定义的6个问题，这6个问题也就是我们的评分标准。从上面的输出结果上看LLM正确回答了所有的6个问题。

根据“理想”/“专家”（人类生成的）答案评估 LLM 对用户的答案

要评价LLM的回答的效果，除了让LLM回答一下简单的问题(这些问题可能是一些简单的统计)以外，还需要评估它与人类生成的专家或者理想的答案之前的差异。下面我们定义一个理想/专家的答案：

test_set_ideal = {
    'customer_msg': """\
tell me about the smartx pro phone and the fotosnap camera, the dslr one.
Also, what TVs or TV related products do you have?""",
    'ideal_answer':"""\
Of course!  The SmartX ProPhone is a powerful \
smartphone with advanced camera features. \
For instance, it has a 12MP dual camera. \
Other features include 5G wireless and 128GB storage. \
It also has a 6.1-inch display.  The price is $899.99.

The FotoSnap DSLR Camera is great for \
capturing stunning photos and videos. \
Some features include 1080p video, \
3-inch LCD, a 24.2MP sensor, \
and interchangeable lenses. \
The price is 599.99.

For TVs and TV related products, we offer 3 TVs \


All TVs offer HDR and Smart TV.

The CineView 4K TV has vibrant colors and smart features. \
Some of these features include a 55-inch display, \
'4K resolution. It's priced at 599.

The CineView 8K TV is a stunning 8K TV. \
Some features include a 65-inch display and \
8K resolution.  It's priced at 2999.99

The CineView OLED TV lets you experience vibrant colors. \
Some features include a 55-inch display and 4K resolution. \
It's priced at 1499.99.

We also offer 2 home theater products, both which include bluetooth.\
The SoundMax Home Theater is a powerful home theater system for \
an immmersive audio experience.
Its features include 5.1 channel, 1000W output, and wireless subwoofer.
It's priced at 399.99.

The SoundMax Soundbar is a sleek and powerful soundbar.
It's features include 2.1 channel, 300W output, and wireless subwoofer.
It's priced at 199.99

Are there any questions additional you may have about these products \
that you mentioned here?
Or may do you have other questions I can help you with?
    """
}

用专家的答案评估LLM的答案

下面我们会使用 OpenAI evals 项目中的提示语来实现BLEU(bilingual evaluation understudy)评估

BLEU 分数：评估两段文本是否相似的另一种方法。

def eval_vs_ideal(test_set, assistant_answer):

    cust_msg = test_set['customer_msg']
    ideal = test_set['ideal_answer']
    completion = assistant_answer
    
    system_message = """\
    You are an assistant that evaluates how well the customer service agent \
    answers a user question by comparing the response to the ideal (expert) response
    Output a single letter and nothing else. 
    """

    user_message = f"""\
You are comparing a submitted answer to an expert answer on a given question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {cust_msg}
    ************
    [Expert]: {ideal}
    ************
    [Submission]: {completion}
    ************
    [END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
    The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
    (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
    (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
    (C) The submitted answer contains all the same details as the expert answer.
    (D) There is a disagreement between the submitted answer and the expert answer.
    (E) The answers differ, but these differences don't matter from the perspective of factuality.
  choice_strings: ABCDE
"""

    messages = [
        {'role': 'system', 'content': system_message},
        {'role': 'user', 'content': user_message}
    ]

    response = get_completion_from_messages(messages)
    return response

我们将上面的翻译成中文，这样便于大家更好的理解：

接下来我们将比较LLM的答案和专家的(理想的)答案，并最终给出一个分数(A,B,C,D,E):

eval_vs_ideal(test_set_ideal, assistant_answer)

从评估结果上看，LLM给自己的答案的成绩打了A，也就是说LLM之前的答案是专家答案的子集，并且与其完全一致，从专家的答案中我们可知，专家对每一个产品都做了非常详细的说明，而这些说明有些是没有包含在系统的产品信息中的，所以LLM的答案是专家答案的一个子集，这个没有问题。因为LLM的回复来基于系统的产品信息，并且LLM不存在歪曲产品信息内容的可能性，因此它与专家答案中的部分内容是完全一致的。下面我们再尝试一个不正确的LLM回复，让它再和专家答案比较一下：

assistant_answer_2 = "life is like a box of chocolates"
eval_vs_ideal(test_set_ideal, assistant_answer_2)

这里我们给出了一个和产品信息无关的LLM回复：“life is like a box of chocolates”，此时评估结果给出的得分是 D，就是说提交的答案与专家的答案存在分歧。

总结

今天我们学习了如何使用两种评估LLM回复的方法，第一种是设计一下简单问题让LLM来回答，第二种是使用一个人类专家的答案来评估LLM的答案与专家答案存在的差异并给出一个评分(A,B,C,D,E)。

-派神-

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫