Unsupervised Quality Estimation for Neural Machine Translation

1)Set up 定义变量

SRC_LANG=en
TGT_LANG=de
INPUT=/home/huyu1/tools/mlqe/data/en-de/train
OUTPUT_DIR=/home/huyu1/datas/en-de/output-en-de-train
TMP=/home/huyu1/datas/en-de/tmp-en-de-train
BPE=/home/huyu1/tools/en-de/bpecodes
MOSES_DECODER=/home/huyu1/tools/mosesdecoder
BPE_ROOT=/home/huyu1/tools/subword-nmt/subword_nmt
MODEL_DIR=/home/huyu1/tools/en-de
GPU=0
DROPOUT_N=10
SCRIPTS=/home/huyu1/fairseq/fairseq/examples/unsupervised_quality_estimation
METEOR=/home/huyu1/tools/meteor-1.5/meteor-1.5.jar
  • BPE(直接定义在了命令行中): path to BPE model(学习bpe模型)

2)Translate the data using standard decoding(模型已经预训练好/标准解码,没有dropout)

Preprocess the input data:
for LANG in $SRC_LANG $TGT_LANG; do
  perl $MOSES_DECODER/scripts/tokenizer/tokenizer.perl -threads 80 -a -l $LANG < $INPUT.$LANG > $TMP/preprocessed.tok.$LANG
  python $BPE_ROOT/apply_bpe.py -c ${BPE} < $TMP/preprocessed.tok.$LANG > $TMP/preprocessed.tok.bpe.$LANG
done
单个运行
  perl $MOSES_DECODER/scripts/tokenizer/tokenizer.perl -threads 80 -a -l en < $INPUT.en > $TMP/preprocessed.tok.en
  python $BPE_ROOT/learn_bpe.py -s 10000 <$TMP/preprocessed.tok.en >$TMP/codes.en
  python $BPE_ROOT/apply_bpe.py -c $TMP/codes.en < $TMP/preprocessed.tok.en > $TMP/preprocessed.tok.bpe.en
  • 分词器tokenizer,将一个完整的句子拆分成Token
  • bpe将原有的单词拆解成了更小单元的高频词进行翻译,有效的解决了未登录词的问题:(处理后的语料中@-@ 是对原本语料中-的替代)
    首先学习bpe模型
    从$ TMP/preprocessed.tok.en两个文件中学习BPE codes并输出到codes文件($ TMP/codes.$LANG)中,10000为BPE codes的个数(取排名前10000的子词)。
    将生成的codes文件应用到文件中进行分词。
Binarize the data for faster translation:
fairseq-preprocess --srcdict $MODEL_DIR/dict.$SRC_LANG.txt --tgtdict $MODEL_DIR/dict.$TGT_LANG.txt --source-lang ${SRC_LANG} --target-lang ${TGT_LANG} --testpref $TMP/preprocessed.tok.bpe --destdir $TMP/bin --workers 4

–fairseq-preprocess:数据预处理:构建词汇表和训练数据的二进制化
–srcdict:重用给定的源字典
–tgtdict:重用给定目标字典
–testpref:逗号分隔,测试文件前缀(从训练集合中缺失的单词被替换为 < unk> )
–destdir:目标目录 默认值:“data-bin”
–workers:number of parallel workers Default: 1

Translate
CUDA_VISIBLE_DEVICES=$GPU fairseq-generate $TMP/bin --path ${MODEL_DIR}/${SRC_LANG}-${TGT_LANG}.pt --beam 5 --source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --unkpen 5 > $TMP/fairseq.out
grep ^H $TMP/fairseq.out | cut -d- -f2- | sort -n | cut -f3- > $TMP/mt.out

fairseq-generate:用经过训练的模型转换预处理后的数据
–path:模型路径
–beam:beam size Default: 5
–no-progress-bar:禁用进度条 Default: False
–unkpen:未知单词惩罚:<0产生更多的unks,>0生成更少的 Default:0

Post-process
sed -r 's/(@@ )| (@@ ?$)//g' < $TMP/mt.out | perl $MOSES_DECODER/scripts/tokenizer/detokenizer.perl -l $TGT_LANG > $OUTPUT_DIR/mt.out

翻译的时候使用的数据均是经过bpe编码的,因此翻译得到的result也是bpe后的,因此需要对result进行还原,还原回bpe之前的(即norm、tok、clean、tc处理后的)状态

3)Produce uncertainty estimates

Scoring
Make temporary files to store the translations repeated N times.(创造临时文件存储重复N次的翻译。)
python ${SCRIPTS}/repeat_lines.py -i $TMP/preprocessed.tok.bpe.$SRC_LANG -n $DROPOUT_N -o $TMP/repeated.$SRC_LANG
python ${SCRIPTS}/repeat_lines.py -i $TMP/mt.out -n $DROPOUT_N -o $TMP/repeated.$TGT_LANG
fairseq-preprocess --srcdict ${MODEL_DIR}/dict.${SRC_LANG}.txt --tgtdict $MODEL_DIR/dict.$TGT_LANG.txt --source-lang ${SRC_LANG} --target-lang ${TGT_LANG} --testpref ${TMP}/repeated --destdir ${TMP}/bin-repeated

将预处理好的源数据和译文重复DROPOUT_N遍
然后重复之前的数据二值化操作

Produce model scores for the generated translations using –retain-dropout option to apply dropout at inference time(使用dropout):
CUDA_VISIBLE_DEVICES=${GPU} fairseq-generate ${TMP}/bin-repeated --path ${MODEL_DIR}/${SRC_LANG}-${TGT_LANG}.pt --beam 5 --source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --unkpen 5 --score-reference --retain-dropout --retain-dropout-modules '["TransformerModel","TransformerEncoder","TransformerDecoder","TransformerEncoderLayer","TransformerDecoderLayer"]'  --seed 46 > $TMP/dropout.scoring.out
grep ^H $TMP/dropout.scoring.out | cut -d- -f2- | sort -n | cut -f2 > $TMP/dropout.scores

–score-reference:just score the reference translation,Default: False
–retain-dropout:Use dropout at inference time Default: False
–retain-dropout-modules: if set, only retain dropout for the specified modules; if not set, then dropout will be retained for all modules

python $SCRIPTS/aggregate_scores.py -i $TMP/dropout.scores -o $OUTPUT_DIR/dropout.scores.mean -n $DROPOUT_N

Generation

Produce multiple translation hypotheses for the same source using –retain-dropout option:
CUDA_VISIBLE_DEVICES=${GPU} fairseq-generate ${TMP}/bin-repeated --path ${MODEL_DIR}/${SRC_LANG}-${TGT_LANG}.pt --beam 5 --source-lang $SRC_LANG --target-lang $TGT_LANG --no-progress-bar --retain-dropout --unkpen 5 --retain-dropout-modules '["TransformerModel","TransformerEncoder","TransformerDecoder","TransformerEncoderLayer","TransformerDecoderLayer"]' --seed 46 > $TMP/dropout.generation.out
grep ^H $TMP/dropout.generation.out | cut -d- -f2- | sort -n | cut -f3- > $TMP/dropout.hypotheses_
sed -r 's/(@@ )| (@@ ?$)//g' < $TMP/dropout.hypotheses_ | perl $MOSES_DECODER/scripts/tokenizer/detokenizer.perl -l $TGT_LANG > $TMP/dropout.hypotheses
python ${SCRIPTS}/meteor.py -i $TMP/dropout.hypotheses -m $METEOR -n $DROPOUT_N -o $OUTPUT_DIR/dropout.gen.sim.meteor
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值