语言模型的评估常用指标（BLEU、ROUGE、PPL）

cd_JJGong

已于 2024-07-02 01:06:32 修改

阅读量1.8k

点赞数 12

文章标签：语言模型人工智能自然语言处理

于 2024-07-02 01:04:39 首次发布

本文链接：https://blog.csdn.net/J_J_Gong/article/details/140113239

版权

语言模型的评估常用指标

1.Accuracy（准确率）:模型预测正确的样本数量占总样本量的比重

2.Precision（精确率）：在被识别为正类别的样本中，为正类别的比例

3.Recall（召回率）：在所有正类别样本中，被正确识别为正类别的比例

4.BLEU分数

评估一种语言翻译成另一种语言的文本质量的指标。它将“质量”的好坏定义为与人类翻译结果的一致性程度，取值范围是[0,1]，越接近1,表明翻译质量越好。

根据‘n-gram’可以划分成多种评价指标，其中‘n-gram’指的是连续的单词个数为n，实践中，通常是取N=1~4，然后进行加权平均
计算过程：计算模型预测的句子和真实结果的N-grams模型，然后统计其匹配的个数，计算匹配度
举例
假设机器翻译的译文candidate和一个参考翻译reference如下：

candidate:It is a nice day today
reference:today is a nice day - 使用1-gram进行匹配 candidate:{It, is, a, nice, day, today}
reference:{today, is, a, nice, day}
结果：匹配度为5/6 - 使用2-gram进行匹配 candidate:{It is, is a, a nice, nice day, day today}
reference:{today is, is a, a nice, nice day}
结果：匹配度为3/5 - 使用3-gram进行匹配 candidate:{It is a, is a nice, a nice day, nice day today}
reference:{today is a, is a nice, a nice day}
结果：匹配度为2/4 4. 极端例子对于： candidate:the the the the
reference:The cat is standing on the ground
如果按照1-gram的方法进行匹配，则匹配度为1，显然是不合理的

python示例代码

from nltk.translate.bleu_score import sentence_bleu
candidate_texts = ["This", "is", "some", "generated", "text"] # 生成的文本
reference_texts = [["This",

最低0.47元/天解锁文章