【文本生成】评价指标：BERTScore

最新推荐文章于 2025-04-03 22:19:58 发布

尽量不躺平的kayla

最新推荐文章于 2025-04-03 22:19:58 发布

阅读量1.7w

点赞数 13

分类专栏：文本生成文章标签：自然语言处理深度学习 pytorch bert

本文链接：https://blog.csdn.net/skying159/article/details/120702567

版权

BERTScore是一种用于评估机器翻译和文本生成的新型度量标准，由Cornell University在2020年ICLR提出。它通过计算句子的上下文嵌入向量的余弦相似度，解决了n-gram匹配方法在捕捉语义和长距离依赖方面的不足。实验证明，BERTScore在多个基准测试中优于现有指标，与人类判断的相关性更强，尤其适用于图像标题生成任务。代码实现简单，可以直接使用预训练模型进行计算。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

BERTScore

这是2020ICLR由Cornell University提出来的论文：BERTSCORE: EVALUATING TEXT GENERATION WITH BERT

主要是基于BERT预训练模型，使用contextual embedding来描述句子，计算两个句子之间的余弦相似度。

基于n-gram matching metric 的常见缺陷：

semantically-correct phrases are penalized because they differ from the surface form of the reference.

解决： In contrast to string matching (e.g., in BLEU) or matching heuristics (e.g., in METEOR), we compute similarity using contextualized token embeddings, which have been shown to be effective for paraphrase detection
n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes.

解决： contextualized embeddings are trained to effectively capture distant dependencies and ordering

实验结果：
（1）In machine translation, BERTSCORE shows stronger system-level and segment-level correlations
with human judgments than existing metrics on multiple common benchmarks.
（2）BERTSCORE is well-correlated with human annotators for image captioning, surpassing SPICE.

bertscore原理

直接下载预训练模型，运用eval方式输入batch sentence，输出cls的embedding，对sentence_can和sentence_ref做相似度矩阵和idf计算，用greedy方式选取最大的相似度，乘以该词的idf，输出BERTScore

论文中还用了+1 smoothing和rescaling来微调BERTScore

bertscore-prf

结论：在机器翻译常用F_BERT作为指标；在评估英文生成时推荐使用 24-layer RoBERTa large model；多语言任务推荐multilingual BERT，中文自动选择的是"bert-base-chinese"。

代码实例：非常简单！

!pip install bert_score

from bert_score import score

# data
cands = ['我们都曾经年轻过']
refs = ['虽然我们都年少，但还是懂事的'

最低0.47元/天解锁文章