【拥抱AI】如何评估大模型生成文本的质量？

最新推荐文章于 2025-03-20 06:30:00 发布

奔跑草-

最新推荐文章于 2025-03-20 06:30:00 发布

阅读量3k

点赞数 32

分类专栏：人工智能文章标签：人工智能 LLM

本文链接：https://blog.csdn.net/u010690311/article/details/143786112

版权

人工智能专栏收录该内容

51 篇文章

订阅专栏

评估生成文本的质量可以从多个角度进行，包括自动评估、人工评估、以及一些高级的评估方法。以下是一个更加详细和丰富的评估指南：
在这里插入图片描述

1. 自动评估

1.1 文本相似度指标

BLEU (Bilingual Evaluation Understudy)：
- 用途：主要用于机器翻译和文本生成任务。
- 计算：基于n-gram重叠度，通常使用1-gram到4-gram。
- 优点：计算简单，易于实现。
- 缺点：只关注n-gram的重叠，不考虑语义和语法的正确性。
- 实现：
```
from nltk.translate.bleu_score import sentence_bleu

reference = ["this is a test".split()]
candidate = "this is a test".split()
bleu_score = sentence_bleu(reference, candidate)
print(f"BLEU Score: {bleu_score}")
```
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)：
- 用途：主要用于文本摘要任务。
- 计算：包括ROUGE-N、ROUGE-L和ROUGE-S等变体。
- 优点：考虑了n-gram的召回率和最长公共子序列。
- 缺点：仍然主要关注词频和顺序，不考虑语义。
- 实现：
```
from rouge import Rouge

reference = "This is a test."
candidate = "This is a test."

rouge = Rouge()
scores = rouge.get_scores(candidate, reference)
print(f"ROUGE Scores: {scores}")
```
METEOR (Metric for Evaluation of Translation with Explicit ORdering)：
- 用途：主要用于机器翻译任务。
- 计算：综合考虑词汇匹配、同义词匹配、词形变化匹配等多种因素。
- 优点：考虑了更多的语义信息。
- 缺点：计算复杂度较高。
- 实现：
```
from nltk.translate.meteor_score import meteor_score

reference = "this is a test".split()
candidate = "this is a test".split()
meteor_score_value = meteor_score([reference], candidate)
print(f"METEOR Score: {meteor_score_value}")
```
CIDEr (Consensus-based Image Description Evaluation)：
- 用途：主要用于图像描述任务。
- 计算：基于TF-IDF加权的n-gram重叠度。
- 优点：考虑了词的重要性。
- 缺点：主要应用于图像描述，不适用于所有文本生成任务。
- 实现：
```
from py cider import Cider

references = [["this is a test"], ["another test"]]
candidates = ["this is a test", "another test"]
cider = Cider()
cider_scores = cider.compute_score(references, candidates)
print(f"CIDEr Scores: {cider_scores}")
```

1.2 语言模型得分

Perplexity：

用途：衡量模型对生成文本的不确定性。
计算：基于语言模型的概率分布。
优点：反映了模型对文本的预测能力。
缺点：需要一个预训练的语言模型。

实现：

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

text = "This is a test."
input_ids = tokenizer.encode(text, return_tensors="pt")
with torch.no_grad():
    loss = model(input_ids, labels=input_ids)[0]
perplexity = torch.exp(loss)
print(f"Perplexity: {perplexity.item()}")

1.3 一致性检查

Self-BLEU：

用途：评估生成文本内部的一致性。
计算：计算生成的多个文本之间的BLEU分数。
优点：反映生成文本的多样性。
缺点：计算复杂度较高。

实现：

from nltk.translate.bleu_score import corpus_bleu

generated_texts = ["this is a test", "another test"]
self_bleu = corpus_bleu([[text] for text in generated_texts], generated_texts)
print(f"Self-BLEU: {self_bleu}")

2. 人工评估

2.1 评分标准

流畅性：
- 评估生成的文本是否通顺、自然。
- 评分范围：1-5分，1分表示非常不通顺，5分表示非常通顺。
连贯性：
- 评估生成的文本是否有逻辑、前后一致。
- 评分范围：1-5分，1分表示非常不连贯，5分表示非常连贯。
相关性：
- 评估生成的文本是否与给定的提示或上下文相关。
- 评分范围：1-5分，1分表示完全不相关，5分表示非常相关。
创新性：
- 评估生成的文本是否有创意、新颖。
- 评分范围：1-5分，1分表示完全没有创新，5分表示非常有创意。
准确性：
- 评估生成的文本是否包含正确的信息。
- 评分范围：1-5分，1分表示完全不准确，5分表示非常准确。

2.2 评估方法

直接评分：

请多名评审员对生成的文本进行打分，通常使用1-5或1-10的评分标准。

实现：

def human_evaluation(texts):
    scores = []
    for text in texts:
        score = float(input(f"Rate the following text from 1 to 5:\n{text}\n"))
        scores.append(score)
    return scores

generated_texts = ["This is a test.", "Another test sentence."]
scores = human_evaluation(generated_texts)
print(f"Human Scores: {scores}")

偏好测试：

让评审员比较多个生成的文本，选择他们认为最好的一个。

实现：

def preference_test(texts):
    preferences = []
    for i in range(len(texts) - 1):
        preference = int(input(f"Which one is better?\n1. {texts[i]}\n2. {texts[i + 1]}\n"))
        preferences.append(preference)
    return preferences

generated_texts = ["This is a test.", "Another test sentence."]
preferences = preference_test(generated_texts)
print(f"Preferences: {preferences}")

任务完成度：

评估生成的文本是否能完成特定任务，如回答问题、撰写文章等。

实现：

def task_completion(texts, tasks):
    completions = []
    for text, task in zip(texts, tasks):
        completion = int(input(f"Does the text '{text}' complete the task '{task}'? (1 for yes, 0 for no)\n"))
        completions.append(completion)
    return completions

generated_texts = ["This is a test.", "Another test sentence."]
tasks = ["Write a short sentence about a test.", "Write a short sentence about another topic."]
completions = task_completion(generated_texts, tasks)
print(f"Task Completions: {completions}")

3. 综合评估

3.1 多指标综合

结合自动评估和人工评估的结果：

将自动评估的分数和人工评分进行加权平均，得到综合评分。

实现：

def combine_scores(automatic_scores, human_scores, weights=(0.5, 0.5)):
    combined_scores = [weights[0] * auto + weights[1] * human for auto, human in zip(automatic_scores, human_scores)]
    return combined_scores

automatic_scores = [0.8, 0.7]
human_scores = [4.5, 4.0]
combined_scores = combine_scores(automatic_scores, human_scores)
print(f"Combined Scores: {combined_scores}")

多模型对比：

比较不同模型的生成结果，选择表现最佳的模型。

实现：

def compare_models(model1_scores, model2_scores):
    model1_avg = sum(model1_scores) / len(model1_scores)
    model2_avg = sum(model2_scores) / len(model2_scores)
    if model1_avg > model2_avg:
        return "Model 1 is better"
    else:
        return "Model 2 is better"

model1_scores = [0.8, 0.7]
model2_scores = [0.7, 0.6]
result = compare_models(model1_scores, model2_scores)
print(f"Comparison Result: {result}")

4. 高级评估方法

4.1 语义相似度

使用BERT等预训练模型计算语义相似度：

用途：评估生成文本与参考文本的语义相似度。

实现：

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

reference = "This is a test."
candidate = "This is a test."

reference_embedding = model.encode(reference, convert_to_tensor=True)
candidate_embedding = model.encode(candidate, convert_to_tensor=True)

similarity = util.pytorch_cos_sim(reference_embedding, candidate_embedding).item()
print(f"Semantic Similarity: {similarity}")

4.2 语法和拼写检查

使用语法和拼写检查工具：

用途：评估生成文本的语法和拼写正确性。

实现：

import language_tool_python

tool = language_tool_python.LanguageToolPublicAPI()

text = "This is a test."
matches = tool.check(text)
error_count = len(matches)
print(f"Grammar and Spelling Errors: {error_count}")

5. 案例研究

5.1 生成新闻文章

任务：生成一篇关于科技发展的新闻文章。
评估指标：
- 自动评估：BLEU、ROUGE、Perplexity
- 人工评估：流畅性、连贯性、相关性、创新性、准确性

实现：

from transformers import T5Tokenizer, T5ForConditionalGeneration

model_name = "t5-base"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

prompt = "Write a news article about recent advancements in technology."
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_length=500, num_return_sequences=1)
generated_article = tokenizer.decode(output[0], skip_special_tokens=True)

print(f"Generated Article: {generated_article}")

# 自动评估
reference_article = "Recent advancements in technology have revolutionized various industries..."
bleu_score = sentence_bleu([reference_article.split()], generated_article.split())
rouge_scores = rouge.get_scores(generated_article, reference_article)
perplexity = compute_perplexity(generated_article)

print(f"BLEU Score: {bleu_score}")
print(f"ROUGE Scores: {rouge_scores}")
print(f"Perplexity: {perplexity}")

# 人工评估
human_scores = human_evaluation([generated_article])
print(f"Human Scores: {human_scores}")