LLM_文本生成评估指标

Scc_hy

已于 2023-05-16 21:40:08 修改

阅读量1.2k

点赞数

分类专栏：深度学习文章标签：机器学习算法深度学习

于 2023-05-16 21:39:15 首次发布

此文为笔者原创，如需转载请联系笔者:hyscc1994@foxmail.com

本文链接：https://blog.csdn.net/Scc_hy/article/details/130714447

版权

深度学习专栏收录该内容

21 篇文章

订阅专栏

BLEU是一种基于精确度的评估方法，用于机器翻译，通过n-gram的精确率和简短惩罚来计算。ROUGE则是基于召回率的评估，关注参考文本中的词有多少出现在生成文本中，包括ROUGE-1,ROUGE-2和ROUGE-L等版本，衡量不同粒度的匹配。这两种指标常用于评估生成文本的质量。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、BLEU (`precision-based metric`)

评估准确率: 和准确率类似-生成的多少词出现在reference的词中
引起问题1：
- 如果生成重复的词，并且该词在引用中出现，那么我们会得到较高的分数
- 针对这点作者指出修正方法：一个单词只计算它在引用中出现的次数。
  - example:
    - ref-“the cat is on the mat”
    - g-“the the the the the the”
  - $P_{vanlilla}=\frac{6}{6}$ , $P_{mod}=\frac{2}{6}$
修正问题1：clip
- 这意味着一个n-gram的出现次数以它在参考句中出现的次数为上限

$p_n=\frac{ \sum_{geSnt \in C}\sum_{n-gram \in geSnt} Count_{clip}(n-gram) }{ \sum_{geSnt \in C}\sum_{n-gram \in geSnt} Count(n-gram) }$

引起问题2：
- 因为这个准确率的评估，很显然会对较短的评估对有力，会低估较长生成的结果。
修正问题2：简短惩罚 brevity penalty
- $e^{1 - \frac{l_{ref}}{l_{gen}}} )$ : 生成长度大于原句子：1, 生成长度小于原句子： $(0, 1)$ ,

最终公式:

$(\prod_{n=1}^N p_n)^{1/N}$

Example: 计算BLEU-4

ref-“the cat sat on the mat”
g-“the cat the cat is on the mat”
BR: $BR=min(1, e^{1-6/8})=1$
n=1
- 1-gram: org:{"the", "cat", "sat", "on", "mat"} ge:{"the", "cat", "is", "on", "mat"}
- clip: $count_{clip}("the") = 2, count_{clip}("cat") = 1, count_{clip}("is") = 0, 1-gram \in geSnt$
- $p_1 = \frac{5}{8}$
n=2
- 2-gram: org:{"the cat", "cat sat", "sat on", "on the", "the mat"} ge:{"the cat", "cat the", "cat is", "is on", "on the", "the mat"}
- $p_2 = \frac{3}{7}$
n=3
- 3-gram: org:{"the cat sat", "cat sat on", "sat on the", "on the mat"} ge:{"the cat the", "cat the cat", "the cat is", "cat is on", "is on the", "on the mat"}
- $p_3 = \frac{1}{6}$
n=4
- 3-gram: org:{"the cat sat on", "cat sat on the", "sat on the mat"} ge:{"the cat the cat", "cat the cat is", "the cat is on", "cat is on the", "is on the mat"}
- $p_4 = \frac{0}{5}$
BLEU-4: $(\frac{5}{8}*\frac{3}{7}*\frac{1}{6}*\frac{0}{5})^{1/4}=0.$

1.1 huggingface `load_metric` 调用`sacrebleu`

可以看出包内的计算原理同上述

from datasets import load_metric
!pip install sacrebleu
bleu_metric = load_metric("sacrebleu")
bleu_metric.add(prediction="the cat the cat is on the mat", reference=["the cat sat on the mat"])
results = bleu_metric.compute(smooth_method="floor", smooth_value=0)
results
"""
{'score': 0.0,
 'counts': [5, 3, 1, 0],
 'totals': [8, 7, 6, 5],
 'precisions': [62.5, 42.857142857142854, 16.666666666666668, 0.0],
 'bp': 1.0,
 'sys_len': 8,
 'ref_len': 6}
"""

二、 ROUGE (`recall-based metric`)

评估召回: 和召回率类似- reference的词中有多少出现在生成词中

$\frac{ \sum_{orgSnt \in C}\sum_{n-gram \in orgSnt} Count_{match}(n-gram) }{ \sum_{orgSnt \in C}\sum_{n-gram \in orgSnt} Count(n-gram) }$

对于最长公共子串longest common substring有个单独的分数ROUGE-L

$R_{LCS}=\frac{LCS(X,Y)}{m}; P_{LCS}=\frac{LCS(X,Y)}{n}$
$F_{LCS}=\frac{(1+\beta ^2)R_{LCS}P_{LCS}}{R_{LCS}+\beta P_{LCS}}, \beta=\frac{P_{LCS}}{R_{LCS}}$

Example: 计算 ROUGE1

ref-“the cat sat on the mat”
g-“the cat the cat is on the mat”
1-gram: org:{“the”, “cat”, “sat”, “on”, “mat”} ge:{“the”, “cat”, “is”, “on”, “mat”}
$ROUGE-1^{r}=\frac{2+1+0+1+1}{6}=\frac{5}{6}$
$BLEU-1^{p}=\frac{min(3,2)+min(2,1)+0+1+1}{8}=\frac{5}{8}$

1.2 huggingface `load_metric` 调用`sacrebleu`

可以看出包内的计算原理同上述

from datasets import load_metric
!pip install rouge_score
rouge_metric = load_metric("rouge")
rouge_metric.add(prediction="the cat the cat is on the mat", reference=["the cat sat on the mat"])
results = rouge_metric.compute()
print(1/(0.5* 1/0.625 + 0.5* 1/0.8333333333333334))
results
"""
0.7142857142857143
{'rouge1': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143)),
 'rouge2': AggregateScore(low=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5), mid=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5), high=Score(precision=0.42857142857142855, recall=0.6, fmeasure=0.5)),
 'rougeL': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143)),
 'rougeLsum': AggregateScore(low=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), mid=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143), high=Score(precision=0.625, recall=0.8333333333333334, fmeasure=0.7142857142857143))}

"""