BLEU、ROUGE详解-语言模型的常用评价指标-举例附代码实现

最新推荐文章于 2025-03-13 22:17:25 发布

小王不叫小王叭

最新推荐文章于 2025-03-13 22:17:25 发布

阅读量2.2k

点赞数 25

分类专栏： LLM 文章标签：语言模型人工智能自然语言处理

本文链接：https://blog.csdn.net/weixin_45573296/article/details/141333719

版权

LLM 专栏收录该内容

5 篇文章

订阅专栏

语言模型的常用评价指标

准确率、精准率、召回率

BLEU分数

BLEU分数：评价一种语言翻译成另一种语言的文本质量的指标。取值范围[0,1]

BLEU 根据n-gram可以划分成多种评价指标，其中n-gram指的是连续的单词个数为n，实践中，通常是取N=1~4，然后对进行加权平均。

下面举例说计算过程(基本步骤)

1.分别计算candidate句和reference句的N-grams模型，然后统计其匹配的个数，计算匹配度

2.公式:candidate和reference中匹配的 n-gram 的个数 /candidate中n-gram 的个数.

假设机器翻译的译文candidate和一个参考翻译reference如下:

candidate: It is a nice day today
reference: today is a nice day

使用1-gram进行匹配:

candidate: {it, is, a, nice, day, today}
reference: {today, is, a, nice, day}
结果:其中{today,is,a,nice,day}匹配，所以匹配度为5/6

使用3-gram进行匹配:

candidate: {it is a, is a nice, a nice day, nice day today}
reference: {today is a, is a nice, a nice day}
结果:其中{is a nice,a nice day}匹配，所以匹配度为2/4

通过上面的例子分析可以发现，匹配的个数越多，BLEU值越大，则说明候选句子重好但是也会出现下面的端情况:

极端例子:

candidate: the the the the
reference: The cat is standing on the ground
如果按照1-gram的方法进行匹配，则匹配度为1，显然是不合理的

首先，计算一个单词在任意一个参考句子出现的最大次数,然后用每个(非重复)单词在参考句子中出现的最大次数来修剪–单词在候选句子的出现次数.如下所示的公式：
$co u n t k = min (C k, S k)$
其中k表示在候选句子(candidate)中出现的第k个词语,Ck则代表在候选句子中这个词语出现的次数，而Sk则代表在参考文本(reference)中这个词语出现的次数。

# BLEU：python代码实现
# 安装nltk的包-->pip install nltk
from nltk.translate.bleu_score import sentence_bleu

def cumulative_bleu(reference, candidate):
    bleu_1_gram = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
    bleu_2_gram = sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0))
    bleu_3_gram = sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0))
    bleu_4_gram = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))
    return bleu_1_gram, bleu_2_gram, bleu_3_gram, bleu_4_gram

# 生成文本
condidate_text = ["This", "is", "some", "generated", "text"]  # 参考文本列表
reference_texts = [["This", "is", "a", "reference", "text"],["This", " is ", "another", "reference", "text"]]# 计算 Bleu 指标
c_bleu = cumulative_bleu(reference_texts, condidate_text)

# 打印结果
print("The Bleu score is:", c_bleu)
# The Bleu score is: (0.6,0.387,1.5945e-102,9.283e-155)

bleu计算公式：p1 ^ w1 * p2 ^ w2 * p3^ w3 * p4 ^ w4

ROUGE

ROUGE ：指标是在机器翻译、自动摘要、问答生成等领域常见的评估指标。ROUGE 通过将模型生成的摘要或者回答与参考答案(一般是人工生成的)进行比较计算，得到对应的得分。

ROUGE指标与BLEU指标非常类似，均可用来衡量生成结果和标准结果的匹配程度，不同的是ROUGE基于召回率，BLEU更看重准确率。ROUGE也分为四种方法:ROUGE-N,ROUGE-L.ROUGE-W,ROUGE-S。
下面举例说计算过程(这里只介绍ROUGEN):
基本步骤:Rouge-N实际上是将模型生成的结果和标准结果按N-gram拆分后，计算召回率
假设模型生成的文本candidate和一个参考文本reference如下:

candidate: It is a nice day today
reference: today is a nice day

使用ROUGE-1进行匹配:

candidate: {it,is,a, nice, day, today}
reference: {today, is, a, nice, day}
结果:其中{today,is,a,nice,day}匹配，所以匹配度为5/5=1,这说明生成的内容完全覆盖了参考文本中的所有单词，质量较高

python代码实现：

#安装rouge-->pip install rouge
import Rouge
# 打印结果
# 生成文本
generated_text = "This is some generated text."
# 参考文本列表
reference_texts = ["This is a reference text.", "This is another generated reference text."]
# 计算 ROUGE 指标
rouge = Rouge()
scores = rouge.get_scores(generated_text, reference_texts[1])
print(scores)

print("ROUGE-1 precision:",scores[0]["rouge-1"]["p"])
print("ROUGE-1 recall:",scores[0]["rouge-1"]["r"])
print("ROUGE-1 F1 score:", scores[0]["rouge-1"]["f"])