BERTScore
这是2020ICLR由Cornell University提出来的论文:BERTSCORE: EVALUATING TEXT GENERATION WITH BERT
主要是基于BERT预训练模型,使用contextual embedding来描述句子,计算两个句子之间的余弦相似度。
基于n-gram matching metric 的常见缺陷:
-
semantically-correct phrases are penalized because they differ from the surface form of the reference.
解决: In contrast to string matching (e.g., in BLEU) or matching heuristics (e.g., in METEOR), we compute similarity using contextualized token embeddings, which have been shown to be effective for paraphrase detection
-
n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes.
解决: contextualized embeddings are trained to effectively capture distant dependencies and ordering
实验结果:
(1)In machine translation, BERTSCORE shows stronger system-level and segment-level correlations
with human judgments than existing metrics on multiple common benchmarks.
(2)BERTSCORE is well-correlated with human annotators for image captioning, surpassing SPICE.
直接下载预训练模型,运用eval方式输入batch sentence,输出cls的embedding,对sentence_can和sentence_ref做相似度矩阵和idf计算,用greedy方式选取最大的相似度,乘以该词的idf,输出BERTScore
论文中还用了+1 smoothing和rescaling来微调BERTScore
结论:在机器翻译常用F_BERT作为指标;在评估英文生成时推荐使用 24-layer RoBERTa large model;多语言任务推荐multilingual BERT,中文自动选择的是"bert-base-chinese"。
代码实例:非常简单!
!pip install bert_score
from bert_score import score
# data
cands = ['我们都曾经年轻过']
refs = ['虽然我们都年少,但还是懂事的'