Open-NMT 使用笔记_opennmt 使用-CSDN博客

本文链接：https://blog.csdn.net/ccbrid/article/details/103660320

官网：https://opennmt.net

是什么：是一个开源NMT工具

OpenNMT is an open source ecosystem for neural machine translation and neural sequence learning.

来源：由哈佛NLP组推出，诞生于2016年年末，主版本基于Torch, 默认语言是Lua

GitHub：https://github.com/OpenNMT/OpenNMT-py/blob/master/docs/source/Summarization.md

其他说明：这里的命令行参数要根据自己的数据以及模型进行更改，关于使用pointer-network或者transformer等不同的参数请从上述网址查看

安装成功之后的使用步骤

建议使用虚拟环境

source activate onmt
cd pytorch/onmt/

数据预处理

输入训练集开发集的src和tgt文件，以shard_size进行大小分割，

输出train.*.pt和valid.*.pt，以及词表文件vocab.pt

onmt_preprocess -train_src data/cnndm/train.txt.src \
                -train_tgt data/cnndm/train.txt.tgt.tagged \
                -valid_src data/cnndm/val.txt.src \
                -valid_tgt data/cnndm/val.txt.tgt.tagged \
                -save_data data/cnndm/CNNDM \
                -src_seq_length 10000 \
                -tgt_seq_length 10000 \
                -src_seq_length_trunc 400 \
                -tgt_seq_length_trunc 100 \
                -dynamic_dict \
                -share_vocab \
                -shard_size 100000

这一步可以省略

输入一个在大规模语料上训练好的embedding文件和第一步处理好的词表文件

输出本程序语料下的embedding文件

python embeddings_to_torch.py -emb_file_both 例如embeddings/glove/glove.6B.300d.txt\
-dict_file data/XXX.vocab.pt \
-output_file data/XXX_embeddings

训练

如果是多卡GPU，在命令前使用CUDA_VISIBLE_DEVICES=0指定在哪一块儿卡上进行训练

onmt_train -save_model models/cnndm \
           -data data/cnndm/CNNDM \
           -copy_attn \
           -global_attention mlp \
           -word_vec_size 128 \
           -rnn_size 512 \
           -layers 1 \
           -encoder_type brnn \
           -train_steps 200000 \
           -max_grad_norm 2 \
           -dropout 0. \
           -batch_size 16 \
           -valid_batch_size 16 \
           -optim adagrad \
           -learning_rate 0.15 \
           -adagrad_accumulator_init 0.1 \
           -reuse_copy_attn \
           -copy_loss_by_seqlength \
           -bridge \
           -seed 777 \
           -world_size 2 \
           -gpu_ranks 0 1

4
测试

（通常是选择训练时在开发集上表现最好的模型进行测试）

在我的机器上，训练可以使用GPU，但是测试的时候却会out of memory

如果同样碰上这个问题的小伙伴只需要去掉【-gpu】这个参数即可

onmt_translate -gpu X \
               -batch_size 20 \
               -beam_size 10 \
               -model models/cnndm... \
               -src data/cnndm/test.txt.src \
               -output testout/cnndm.out \
               -min_length 35 \
               -verbose \
               -stepwise_penalty \
               -coverage_penalty summary \
               -beta 5 \
               -length_penalty wu \
               -alpha 0.9 \
               -verbose \
               -block_ngram_repeat 3 \
               -ignore_when_blocking "." "</t>" "<t>"

在进行这一步的时候，会出现【OverflowError: math range error】的错误

这个错误可以忽略不看他，因为出现这个错误时，我们的测试文件【testout/cnndm.out】已经生成好了

结果测评

举例

python test_rouge.py -r data/test.tgt.new -c testout/cnndm.out

参数说明：

Preprocessing the data（数据预处理）

--dynamic_dict：使用了copy-attention时，需要预处理数据集，以使source和target对齐。

--share_vocab：使source和target使用相同的字典。

Training （参数选择和实现大部分和 See et al相似）

--copy_attn: 【copy】This is the most important option, since it allows the model to copy words from the source.

--global_attention mlp: 使用 Bahdanau et al. [3] 的attention mechanism代替 Luong et al. [4] (global_attention dot).

--share_embeddings: 使encoder和decoder共享word embeddings. 大大减少了模型必须学习的参数数量。We did not find this option to helpful, but you can try it out by adding it to the command below.

--reuse_copy_attn: 将standard attention 重用为copy attention. Without this, model learns an additional attention that is only used for copying.

--copy_loss_by_seqlength: 将 loss 除以序列长度. 实践中我们发现这可以使inference时生成长序列. However, this effect can also be achieved by using penalties during decoding.

--bridge: This is an additional layer that uses the final hidden state of the encoder as input and computes an initial hidden state for the decoder. Without this, the decoder is initialized with the final hidden state of the encoder directly.

--optim adagrad: Adagrad 优于 SGD when coupled with the following option.

--adagrad_accumulator_init 0.1: PyTorch does not initialize the accumulator in adagrad with any values. To match the optimization algorithm with the Tensorflow version, this option needs to be added.

Inference（使用beam-search of 10. 也加入解码中可以使用的特定惩罚项，如下）

--stepwise_penalty: Applies penalty at every step

--coverage_penalty summary: 【coverage】使用惩罚项防止同一个source word的repeated attention

--beta 5: Coverage Penalty的参数

--length_penalty wu: 使用Wu et al的长度惩罚项

--alpha 0.8: Parameter for the Length Penalty.

--block_ngram_repeat 3: 防止模型 repeating trigrams.

--ignore_when_blocking "." "</t>" "<t>": 允许模型句子边界的tokens repeat trigrams .

示例command: http://opennmt.net/OpenNMT-py/Summarization.html

数据预处理

我们还关闭了源的截断功能，以确保不会截断超过50个单词的输入。

对于CNN-DM，我们遵循See等。[2]并另外将源长度截断为400个令牌，将目标长度截断为100个令牌。

我们还注意到CNN-DM中，我们发现如果target将句子使用标签包围起来，使得句子看起来像 <t> w1 w2 w3 . </t>，模型会更好地工作。如果使用这种格式，则可以在 inference 步骤之后使用命令sed -i 's/ <\/t>//g' FILE.txt和sed -i 's/<t> //g' FILE.txt删除标签。

更ing