opennmt-pytorch版本学习

最新推荐文章于 2024-03-23 09:45:26 发布

爱上代码的虫

最新推荐文章于 2024-03-23 09:45:26 发布

阅读量225

点赞数

文章标签： pytorch 深度学习 nlp

本文链接：https://blog.csdn.net/qq_43370683/article/details/130058965

版权

opennmt-pytorch版本学习

原文来自——opennmt-pytorch的官方文档Quickstart — OpenNMT-py documentation

快速入手

0、准备好opnmt-py

pip install --upgrade pip
pip install OpenNMT-py

更详细的指导在这里https://github.com/ymoslem/OpenNMT-Tutorial

1、准备好数据首先，建议下载一个试验性的英语-德语数据集，用于机器翻译，其中包含 10k 个标记化句子：

wget https://s3.amazonaws.com/opennmt-trainingdata/toy-ende.tar.gz
tar xf toy-ende.tar.gz
cd toy-ende

这个数据同时包含了src和tgt，每一行都是用空格分开的由tokens组成的段落。

src-train.txt
tgt-train.txt
src-val.txt
tgt-val.txt

验证文件用于评估训练的收敛性。它通常包含不超过 5k 个句子。

查看数据集

$ head -n 2 toy-ende/src-train.txt
It is not acceptable that , with the help of the national bureaucracies , Parliament &apos;s legislative prerogative should be made null and void by means of implementing provisions whose content , purpose and extent are not laid down in advance .
Federal Master Trainer and Senior Instructor of the Italian Federation of Aerobic Fitness , Group Fitness , Postural Gym , Stretching and Pilates; from 2004 , he has been collaborating with Antiche Terme as personal Trainer and Instructor of Stretching , Pilates and Postural Gym .

编写YAML配置文件，指定需要使用的数据。

# toy_en_de.yaml

## Where the samples will be written
save_data: toy-ende/run/example
## Where the vocab(s) will be written
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt
# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    corpus_1:
        path_src: toy-ende/src-train.txt
        path_tgt: toy-ende/tgt-train.txt
    valid:
        path_src: toy-ende/src-val.txt
        path_tgt: toy-ende/tgt-val.txt
...

通过配置，我们可以构建词汇表，用于训练。

onmt_build_vocab -config toy_en_de.yaml -n_sample 10000

注释：

-n_sample这里需要 - 它表示从每个语料库采样以构建词汇表的行数。
此配置是最简单的配置，无需任何标记化或其他转换。有关更复杂的管道，请参阅其他示例配置。

2、模型训练

为了训练一个模型，需要yaml文件添加一些内容

指定使用的词汇表
训练使用的参数

# toy_en_de.yaml

...

# Vocabulary files that were just created
src_vocab: toy-ende/run/example.vocab.src
tgt_vocab: toy-ende/run/example.vocab.tgt

# Train on a single GPU
world_size: 1
gpu_ranks: [0]

# Where to save the checkpoints
save_model: toy-ende/run/model
save_checkpoint_steps: 500
train_steps: 1000
valid_steps: 500

训练模型使用

onmt_train -config toy_en_de.yaml

编码器和译码器都默认采用2层lstm，同时隐藏层是100个单元。默认使用第一个gpu。

在训练过程实际开始之前，通过启用-dump_fields和-dump_transforms标志，可以使用-config yaml文件中指定的配置将*.vocb.pt和*.transforms.pt转储到-save_data。还可以生成转换后的样本，以简化任何可能需要的目视检查。每个语料库要转储的样本行数使用-n_sample标志设置。

有关更多推荐的模型和参数，请参阅其他示例配置或常见问题解答。

3、翻译

onmt_translate -model toy-ende/run/model_step_1000.pt -src toy-ende/src-test.txt -output toy-ende/pred_1000.txt -gpu 0 -verbose

现在，您有一个可用于预测新数据的模型。我们通过运行波束搜索来做到这一点。这会将预测输出到 .toy-ende/pred_1000.txt

注：
由于演示数据集很小，预测结果将非常糟糕。尝试在一些较大的数据集上运行！例如，您可以下载数以百万计的平行句子进行翻译或摘要。

爱上代码的虫

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
opennmt-pytorch版本学习

通过示例学习opennmt
复制链接

扫一扫

opennmt-pytorch版本学习

opennmt-pytorch版本学习

快速入手

2、模型训练

3、翻译

“相关推荐”对你有帮助么？