大模型幻觉（Hallucination）控制方法Guardrails

技术与健康

于 2024-08-19 08:07:48 发布

阅读量260

点赞数

分类专栏： LLM 文章标签：人工智能机器学习

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/Practicer2015/article/details/141069563

版权

LLM 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

Guardrails是一套规则和检查，旨在确保 LLM 的输出准确、适当且符合用户期望，控制幻觉。

这里介绍两种Guardrails的应用

Input guardrails 针对输入到LLM的不合规的请求进行处理
Output guardrails 对模型反馈内容到最终用户前进行验证。

Input guardrails

#step1
import openai

GPT_MODEL = 'gpt-4o-mini'
#step2
system_prompt = "You are a helpful assistant."

bad_request = "I want to talk about horses"
good_request = "What are the best breeds of dog for people that like cats?"

step3
import asyncio


async def get_chat_response(user_request):
    print("Getting LLM response")
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0.5
    )
    print("Got LLM response")

    return response.choices[0].message.content


async def topical_guardrail(user_request):
    print("Checking topical guardrail")
    messages = [
        {
            "role": "system",
            "content": "Your role is to assess whether the user question is allowed or not. The allowed topics are cats and dogs. If the topic is allowed, say 'allowed' otherwise say 'not_allowed'",
        },
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0
    )

    print("Got guardrail response")
    return response.choices[0].message.content


async def execute_chat_with_guardrail(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return "I can only talk about cats and dogs, the best animals that ever lived."
            elif chat_task in done:
                chat_response = chat_task.result()
                return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again
#step4
# Call the main function with the good request - this should go through
response = await execute_chat_with_guardrail(good_request)
print(response)

如果你喜欢猫并且正在考虑养一只狗，有几种品种以与猫科动物朋友的兼容性而闻名。以下是一些最适合与猫相处的狗品种：
 
1. **金毛猎犬**：友好而宽容，金毛猎犬往往与其他动物相处得很好，包括猫。
 
2. **拉布拉多猎犬**：与金毛猎犬相似，拉布拉多猎犬也很善于交际，是猫的好伙伴。
 
3. **骑士查理王猎犬**：这个品种温柔而深情，经常与其他宠物建立牢固的联系。
 
4. **巴吉度猎犬**：巴吉度猎犬很悠闲，一般举止平静，这可以帮助它们与猫和平共处。
 
5. **小猎犬**：小猎犬友好而善于交际，它们经常喜欢与其他动物（包括猫）的陪伴。
 
6. **哈巴狗**：哈巴狗以活泼友好的性格而闻名，这使它们成为猫的好伙伴。
7.  **西施犬**：西施犬通常友好且适应性强，通常与其他宠物相处得很好。
8. 8**牧羊犬**：牧羊犬以其温柔和保护性的性格而闻名，这也延伸到了它们与猫的关系上。
9.  **纽芬兰犬**：这些温柔的大型犬以其平静的举止而闻名，通常与其他动物相处得很好。
10.  **可卡犬**：可卡犬是友好而深情的狗，如果介绍得当，可以与猫相处得很好。

将狗介绍给猫时，重要的是循序渐进并监督它们的互动，以确保建立积极的关系。每只狗的性格可能有所不同，因此个体性情是确定兼容性的关键。

# Call the main function with the good request - this should get blocked
response = await execute_chat_with_guardrail(bad_request)
print(response)

我只能谈论猫和狗，它们是有史以来最好的动物。

看来我们的Input guardrails起了作用——第一个问题被允许通过，但第二个问题因偏离主题而被阻止。

Output guardrails

这些可以有多种形式，其中最常见的是

幻觉/事实核查：使用一组基本事实信息来阻止幻觉反应。
Moderation审核：应用公司指南来审核 LLM的结果，如果违反这些指南，则阻止或重写其回应。
语法检查

下面使用一种Moderation审核,它使用G-Eval评估方法的一个版本来对 LLM 响应中是否存在不想要的内容进行评分。为了实现这一点，我们将建立一个可扩展的框架来审核内容，

我们提供标准，明确概述内容应该包含和不应该包含的内容。
为LLM提供了对内容进行评分的分步说明。
LLM 返回 1 至 5之间的离散分数。

设置阈值
我们的输出将评估 LLM 的响应并阻止任何得分为 3 或更高的内容。

#step1
domain = "animal breed recommendation"

animal_advice_criteria = """
Assess the presence of explicit recommendation of cat or dog breeds in the content.
The content should contain only general advice about cats and dogs, not specific breeds to purchase."""

animal_advice_steps = """
1. Read the content and the criteria carefully.
2. Assess how much explicit recommendation of cat or dog breeds is contained in the content.
3. Assign an animal advice score from 1 to 5, with 1 being no explicit cat or dog breed advice, and 5 being multiple named cat or dog breeds.
"""

moderation_system_prompt = """
You are a moderation assistant. Your role is to detect content about {domain} in the text provided, and mark the severity of that content.

## {domain}

### Criteria

{scoring_criteria}

### Instructions

{scoring_steps}

### Content

{content}

### Evaluation (score only!)

#step2
async def moderation_guardrail(chat_response):
    print("Checking moderation guardrail")
    mod_messages = [
        {"role": "user", "content": moderation_system_prompt.format(
            domain=domain,
            scoring_criteria=animal_advice_criteria,
            scoring_steps=animal_advice_steps,
            content=chat_response
        )},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=mod_messages, temperature=0
    )
    print("Got moderation response")
    return response.choices[0].message.content
    
    
async def execute_all_guardrails(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return "I can only talk about cats and dogs, the best animals that ever lived."
            elif chat_task in done:
                chat_response = chat_task.result()
                moderation_response = await moderation_guardrail(chat_response)

                if int(moderation_response) >= 3:
                    print(f"Moderation guardrail flagged with a score of {int(moderation_response)}")
                    return "Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have."

                else:
                    print('Passed moderation')
                    return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again
#step3
# Adding a request that should pass both our topical guardrail and our moderation guardrail
great_request = 'What is some advice you can give to a new dog owner?'

#step4
tests = [good_request,bad_request,great_request]

for test in tests:
    result = await execute_all_guardrails(test)
    print(result)
    print('\n\n')

得到审核回应
审核护栏标记为 5 分
<strong/>抱歉，我们无权提供动物品种建议。我可以帮助您解答任何常规问题。


得到 LLM 回应
主题护栏触发
我只能谈论猫和狗，它们是有史以来最好的动物。


得到审核回应
审核护栏标记为 3 分
<strong/>抱歉，我们无权提供动物品种建议。我可以帮助您解答任何常规问题。

附Guardrails开发框架
NeMo Guardrails
Guardrails AI