#LLM入门|Prompt#2.7_检查结果_Check_Outputs

最新推荐文章于 2024-06-09 13:15:25 发布

向日葵花籽儿

最新推荐文章于 2024-06-09 13:15:25 发布

阅读量392

点赞数 7

分类专栏： LLM入门教程笔记文章标签： prompt python LLM AIGC

本文链接：https://blog.csdn.net/weixin_45312236/article/details/136452411

版权

LLM入门教程笔记专栏收录该内容

35 篇文章 3 订阅

订阅专栏

引领你了解

·如何评估系统生成的输出。
确保在向用户展示输出之前，对其质量、相关性和安全性进行严格的检查，以保证我们提供的反馈是准确和适用的。
如何运用审查(Moderation) API 来对输出进行评估
如何通过额外的 Prompt 提升模型在展示输出之前的质量评估。

一、检查有害内容

我们主要通过 OpenAI 提供的 Moderation API 来实现对有害内容的检查。

import openai
from tool import get_completion_from_messages

final_response_to_customer = f"""
SmartX ProPhone 有一个 6.1 英寸的显示屏，128GB 存储、\
1200 万像素的双摄像头，以及 5G。FotoSnap 单反相机\
有一个 2420 万像素的传感器，1080p 视频，3 英寸 LCD 和\
可更换的镜头。我们有各种电视，包括 CineView 4K 电视，\
55 英寸显示屏，4K 分辨率、HDR，以及智能电视功能。\
我们也有 SoundMax 家庭影院系统，具有 5.1 声道，\
1000W 输出，无线重低音扬声器和蓝牙。关于这些产品或\
我们提供的任何其他产品您是否有任何具体问题？
"""
# Moderation 是 OpenAI 的内容审核函数，旨在评估并检测文本内容中的潜在风险。
response = openai.Moderation.create(
    input=final_response_to_customer
)
moderation_output = response["results"][0]
print(moderation_output)

{
  "categories": {
    "harassment": false,
    "harassment/threatening": false,
    "hate": false,
    "hate/threatening": false,
    "self-harm": false,
    "self-harm/instructions": false,
    "self-harm/intent": false,
    "sexual": false,
    "sexual/minors": false,
    "violence": false,
    "violence/graphic": false
  },
  "category_scores": {
    "harassment": 4.2861907e-07,
    "harassment/threatening": 5.9538485e-09,
    "hate": 2.079682e-07,
    "hate/threatening": 5.6982725e-09,
    "self-harm": 2.3966843e-08,
    "self-harm/instructions": 1.5763412e-08,
    "self-harm/intent": 5.042827e-09,
    "sexual": 2.6989035e-06,
    "sexual/minors": 1.1349888e-06,
    "violence": 1.2788286e-06,
    "violence/graphic": 2.6259923e-07
  },
  "flagged": false
}

检查输出质量
检查输出的质量对于确保聊天机器人的表现至关重要。以下是一些检查输出质量的方法：
1. 标记输出

如果输出没有被标记为任何特定类别，并且在所有类别中得分都很低，说明结果评判合理。
对于对内容敏感的受众，可以设定更低的阈值来标记可能存在问题的输出。

2. 采取适当措施

如果某些内容被标记为有问题，可以采取适当措施，如提供替代答案或生成新的响应。
随着模型的不断改进，有害输出的可能性会降低。

3. 向模型询问

通过将生成的输出作为输入的一部分再次提供给模型，并要求其评估输出的质量，可以检查输出的质量。
这种操作可以通过多种方式完成，例如将生成的输出提供给模型，并要求其评估输出的质量。

总结

确保聊天机器人输出的质量符合预期是非常重要的。
通过标记输出、采取适当措施和向模型询问来检查输出的质量。
随着模型的持续改进，输出质量应该会逐渐提高。

二、检查是否符合产品信息

在下列示例中，我们要求 LLM 作为一个助理检查回复是否充分回答了客户问题，并验证助理引用的事实是否正确。

# 这是一段电子产品相关的信息
system_message = f"""
您是一个助理，用于评估客服代理的回复是否充分回答了客户问题，\
并验证助理从产品信息中引用的所有事实是否正确。 
产品信息、用户和客服代理的信息将使用三个反引号（即 ```）\
进行分隔。 
请以 Y 或 N 的字符形式进行回复，不要包含标点符号：\
Y - 如果输出充分回答了问题并且回复正确地使用了产品信息\
N - 其他情况。

仅输出单个字母。
"""

#这是顾客的提问
customer_message = f"""
告诉我有关 smartx pro 手机\
和 fotosnap 相机（单反相机）的信息。\
还有您电视的信息。
"""
product_information = """{ "name": "SmartX ProPhone", "category": "Smartphones and Accessories", "brand": "SmartX", "model_number": "SX-PP10", "warranty": "1 year", "rating": 4.6, "features": [ "6.1-inch display", "128GB storage", "12MP dual camera", "5G" ], "description": "A powerful smartphone with advanced camera features.", "price": 899.99 } { "name": "FotoSnap DSLR Camera", "category": "Cameras and Camcorders", "brand": "FotoSnap", "model_number": "FS-DSLR200", "warranty": "1 year", "rating": 4.7, "features": [ "24.2MP sensor", "1080p video", "3-inch LCD", "Interchangeable lenses" ], "description": "Capture stunning photos and videos with this versatile DSLR camera.", "price": 599.99 } { "name": "CineView 4K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-4K55", "warranty": "2 years", "rating": 4.8, "features": [ "55-inch display", "4K resolution", "HDR", "Smart TV" ], "description": "A stunning 4K TV with vibrant colors and smart features.", "price": 599.99 } { "name": "SoundMax Home Theater", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-HT100", "warranty": "1 year", "rating": 4.4, "features": [ "5.1 channel", "1000W output", "Wireless subwoofer", "Bluetooth" ], "description": "A powerful home theater system for an immersive audio experience.", "price": 399.99 } { "name": "CineView 8K TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-8K65", "warranty": "2 years", "rating": 4.9, "features": [ "65-inch display", "8K resolution", "HDR", "Smart TV" ], "description": "Experience the future of television with this stunning 8K TV.", "price": 2999.99 } { "name": "SoundMax Soundbar", "category": "Televisions and Home Theater Systems", "brand": "SoundMax", "model_number": "SM-SB50", "warranty": "1 year", "rating": 4.3, "features": [ "2.1 channel", "300W output", "Wireless subwoofer", "Bluetooth" ], "description": "Upgrade your TV's audio with this sleek and powerful soundbar.", "price": 199.99 } { "name": "CineView OLED TV", "category": "Televisions and Home Theater Systems", "brand": "CineView", "model_number": "CV-OLED55", "warranty": "2 years", "rating": 4.7, "features": [ "55-inch display", "4K resolution", "HDR", "Smart TV" ], "description": "Experience true blacks and vibrant colors with this OLED TV.", "price": 1499.99 }"""

q_a_pair = f"""
顾客的信息: ```{customer_message}```
产品信息: ```{product_information}```
代理的回复: ```{final_response_to_customer}```

回复是否正确使用了检索的信息？
回复是否充分地回答了问题？

输出 Y 或 N
"""
#判断相关性
messages = [
    {'role': 'system', 'content': system_message},
    {'role': 'user', 'content': q_a_pair}
]

response = get_completion_from_messages(messages, max_tokens=1)
print(response)

Y
在上一个示例中，我们给了一个正例，LLM 很好地做出了正确的检查。而在下一个示例中，我们将提供一个负例，LLM 同样能够正确判断。

another_response = "生活就像一盒巧克力"
q_a_pair = f"""
顾客的信息: ```{customer_message}```
产品信息: ```{product_information}```
代理的回复: ```{another_response}```

回复是否正确使用了检索的信息？
回复是否充分地回答了问题？

输出 Y 或 N
"""
messages = [
    {'role': 'system', 'content': system_message},
    {'role': 'user', 'content': q_a_pair}
]

response = get_completion_from_messages(messages)
print(response)

N
审查 API 可以提供生成输出质量的反馈，你可以根据这些反馈来决定是否展示输出给用户，或者生成新的回应。你还可以尝试为每个用户查询多个模型回应，然后从中选择最佳的回应呈现给用户。因此，你有多种可能的尝试方式。
然而，在大多数情况下，审查输出并不是必需的，特别是当你使用更先进的模型，比如 GPT-4。实际上，在真实生产环境中，我们并未看到很多人采取这种方式。这种做法也会增加系统的延迟和成本，因为你需要等待额外的 API 调用，并且需要额外的 token。如果你的应用或产品的错误率仅为 0.0000001%，那么你可能可以尝试这种策略。但总的来说，我们并不建议在实际应用中使用这种方式。
在接下来的章节中，我们将整合我们在评估输入、处理输出以及审查生成内容方面学到的知识，构建一个端到端的系统。