sacrebleu.corpus_bleu不同tokenizer对BLEU分数影响
sacrebleu.corpus_bleu函数不同tokenizer影响
sacrebleu.corpus_bleu的使用可以参考其他博客,在我使用bleu评分的时候,发现对于相同的两个句子,不同tokenizer对bleu函数带来了非常大的影响,故记录一下。
代码如下,这里的参数设置是从lm_eval的bleu评测函数copy过来的lm-eval-harness bleu
import sacrebleu
refs = [["最近3个月用户绑车的趋势图。"]]
sys = ["近3个月用户绑车的人数趋势。"]
bleu = sacrebleu.corpus_bleu(sys, refs,
smooth_method="exp",
smooth_value=0.0,
force=False,
lowercase=False,
tokenize="flores101",
use_effective_order=False)
sacrebleu.corpus_bleu提供的tokenizer如下,我们依次来试一下效果。
_TOKENIZERS = {
'none': 'tokenizer_none.NoneTokenizer',
'zh': 'tokenizer_zh.TokenizerZh',
'13a': 'tokenizer_13a.Tokenizer13a',
'intl': 'tokenizer_intl.TokenizerV14International',
'char': 'tokenizer_char.TokenizerChar',
'ja-mecab': 'tokenizer_ja_mecab.TokenizerJaMecab',
'ko-mecab': 'tokenizer_ko_mecab.TokenizerKoMecab',
'spm': 'tokenizer_spm.TokenizerSPM',
'flores101': 'tokenizer_spm.Flores101Tokenizer',
'flores200': 'tokenizer_spm.Flores200Tokenizer',
}
tokenizer==none
# 结果
BLEU = 0.00 0.0/0.0/0.0/0.0 (BP = 1.000 ratio = 1.000 hyp_len = 1 ref_len = 1)
tokenizer==zh
# 结果
BLEU = 65.92 85.7/69.2/58.3/54.5 (BP = 1.000 ratio = 1.000 hyp_len = 14 ref_len = 14)
tokenizer==13a
# 结果
BLEU = 0.00 0.0/0.0/0.0/0.0 (BP = 1.000 ratio = 1.000 hyp_len = 1 ref_len = 1)
tokenizer==intl
# 结果
BLEU = 0.00 50.0/50.0/0.0/0.0 (BP = 1.000 ratio = 1.000 hyp_len = 2 ref_len = 2)
tokenizer==char
# 结果
BLEU = 65.92 85.7/69.2/58.3/54.5 (BP = 1.000 ratio = 1.000 hyp_len = 14 ref_len = 14)
tokenizer==ja-mecab
# 结果
BLEU = 48.55 77.8/50.0/42.9/33.3 (BP = 1.000 ratio = 1.000 hyp_len = 9 ref_len = 9)
tokenizer==ko-mecab
# 结果
BLEU = 57.58 78.6/61.5/50.0/45.5 (BP = 1.000 ratio = 1.077 hyp_len = 14 ref_len = 13)
tokenizer==spm
# 结果
BLEU = 29.07 60.0/33.3/25.0/14.3 (BP = 1.000 ratio = 1.111 hyp_len = 10 ref_len = 9)
tokenizer==flores101
# 结果
BLEU = 29.07 60.0/33.3/25.0/14.3 (BP = 1.000 ratio = 1.111 hyp_len = 10 ref_len = 9)
tokenizer==flores200
# 结果
BLEU = 48.33 75.0/54.5/40.0/33.3 (BP = 1.000 ratio = 1.000 hyp_len = 12 ref_len = 12)
总结
zh的效果是最好的,但harness用的是intl,可能是评测数据集是英文的原因,zh对中文的tokenize比较友好。