NMT-nametus入门了解（基于TensorFlow的版本）

最新推荐文章于 2021-12-01 18:18:53 发布

bobaonemimimii

最新推荐文章于 2021-12-01 18:18:53 发布

阅读量1.1k

点赞数

分类专栏： NMT(自然语言处理) python

本文链接：https://blog.csdn.net/Cekiosk/article/details/82931509

版权

python 同时被 2 个专栏收录

9 篇文章 0 订阅

订阅专栏

NMT(自然语言处理)

4 篇文章 0 订阅

订阅专栏

1、首先看github上的readme：

NEMATUS

Attention-based encoder-decoder model for neural machine translation //基于注意力编码解码模型

This package is based on the dl4mt-tutorial by Kyunghyun Cho et al. ( https://github.com/nyu-dl/dl4mt-tutorial ). It was used to produce top-scoring systems at the WMT 16 shared translation task.//基于在WMT16里获得优秀表现的dl4mt

The changes to Nematus include:

the model has been re-implemented in tensorflow. See https://github.com/EdinburghNLP/nematus/tree/theano for the Theano-based version of Nematus.//使用TensorFlow重构了
new architecture variants for better performance: //新的变量结构带来更好的表现
- arbitrary input features (factored neural machine translation) http://www.statmt.org/wmt16/pdf/W16-2209.pdf //任意输入特征
- deep models (Miceli Barone et al., 2017) https://arxiv.org/abs/1707.07631//深度模型
- dropout on all layers (Gal, 2015) http://arxiv.org/abs/1512.05287 //每一层都有输出
- tied embeddings (Press and Wolf, 2016) https://arxiv.org/abs/1608.05859 //绑定嵌入
- layer normalisation (Ba et al, 2016) https://arxiv.org/abs/1607.06450 //层规范化
improvements to scoring and decoding:
- n-best output for decoder //n-best 输出（输出最好的n个）
- scripts for scoring (given parallel corpus) and rescoring (of n-best output) //改进了打分脚本
usability improvements:
- command line interface for training //训练的命令行接口
- vocabulary files and model parameters are stored in JSON format (backward-compatible loading) //单词向下兼容（兼容以前的），是json格式
- server mode

see changelog for more info. //看changelog有更多信息

SUPPORT

For general support requests, there is a Google Groups mailing list at https://groups.google.com/d/forum/nematus-support . You can also send an e-mail to nematus-support@googlegroups.com .

INSTALLATION

Nematus requires the following packages: //安装需要

Python >= 2.7
tensorflow //TensorFlow

To install tensorflow, we recommend following the steps at: ( https://www.tensorflow.org/install/ ) //安装TensorFlow的步骤

the following packages are optional, but highly recommended //如果是gpu运行的话推荐安装cuda

CUDA >= 7 (only GPU training is sufficiently fast)
cuDNN >= 4 (speeds up training substantially)

DOCKER USAGE//docker：一种虚拟环境打包运行的方式

docker简介：https://baike.baidu.com/item/Docker/13344470?fr=aladdin

http://www.runoob.com/docker/docker-tutorial.html

You can also create docker image by running following command, where you change suffix to either cpu or gpu:

docker build -t nematus-docker -f Dockerfile.suffix .

To run a CPU docker instance with the current working directory shared with the Docker container, execute:

docker run -v `pwd`:/playground -it nematus-docker

For GPU you need to have nvidia-docker installed and run:

nvidia-docker run -v `pwd`:/playground -it nematus-docker

TRAINING SPEED

Training speed depends heavily on having appropriate hardware (ideally a recent NVIDIA GPU), and having installed the appropriate software packages.

To test your setup, we provide some speed benchmarks with `test/test_train.sh', on an Intel Xeon CPU E5-2620 v4, with a Nvidia GeForce GTX Titan X (Pascal) and CUDA 9.0:

GPU, CuDNN 5.1, tensorflow 1.0.1:

CUDA_VISIBLE_DEVICES=0 ./test_train.sh

225.25 sentenses/s

USAGE INSTRUCTIONS

All of the scripts below can be run with --help flag to get usage information. //可以在对所有脚本使用 Python --help

Sample commands with toy examples are available in the test directory; for training a full-scale system, consider the training scripts at http://data.statmt.org/wmt17_systems/training/

// 为了训练一个完整的系统，需要考虑WMT17里面的sample（转2）

接下来是具体的模型和参数介绍：

nematus/nmt.py : use to train a new model //训练新模型的nmt.py

data sets; model loading and saving //数据集模型的装载与保存

parameter	description
--source_dataset PATH	parallel training corpus (source side)
--target_dataset PATH	parallel training corpus (target side)
--dictionaries PATH [PATH ...]	network vocabularies (one per source factor, plus target vocabulary)
--model PATH	model file name (default: model.npz)
--saveFreq INT	save frequency (default: 30000)
--reload	load existing model (if '--model' points to existing model)
--no_reload_training_progress	don't reload training progress (only used if --reload is enabled)
--summary_dir	directory for saving summaries (default: same directory as the --saveto file)
--summaryFreq	Save summaries after INT updates, if 0 do not save summaries (default: 0)

network parameters

parameter	description
--embedding_size INT	embedding layer size (default: 512)
--state_size INT	hidden layer size (default: 1000)
--source_vocab_sizes INT	source vocabulary sizes (one per input factor) (default: None)
--target_vocab_size INT	target vocabulary size (default: None)
--factors INT	number of input factors (default: 1)
--dim_per_factor INT [INT ...]	list of word vector dimensionalities (one per factor): '--dim_per_factor 250 200 50' for total dimensionality of 500 (default: None)
--use_dropout	use dropout layer (default: False)
--dropout_embedding FLOAT	dropout for input embeddings (0: no dropout) (default: 0.2)
--dropout_hidden FLOAT	dropout for hidden layer (0: no dropout) (default: 0.2)
--dropout_source FLOAT	dropout source words (0: no dropout) (default: 0)
--dropout_target FLOAT	dropout target words (0: no dropout) (default: 0)
--layer_normalisation	use layer normalisation (default: False)
--tie_decoder_embeddings	tie the input embeddings of the decoder with the softmax output embeddings
--enc_depth INT	number of encoder layers (default: 1)
--enc_recurrence_transition_depth	number of GRU transition operations applied in an encoder layer (default: 1)
--dec_depth INT	number of decoder layers (default: 1)
--dec_base_recurrence_transition_depth	number of GRU transition operations applied in first decoder layer (default: 2)
--dec_high_recurrence_transition_depth	number of GRU transition operations applied in decoder layers after the first (default: 1)
--dec_deep_context	pass context vector (from first layer) to deep decoder layers
--output_hidden_activation	activation function in hidden layer of the output network (default: tanh)

training parameters

parameter	description
--maxlen INT	maximum sequence length (default: 100)
--batch_size INT	minibatch size (default: 80)
--token_batch_size INT	minibatch size (expressed in number of source or target tokens). Sentence-level minibatch size will be dynamic. If this is enabled, batch_size only affects sorting by length.
--max_epochs INT	maximum number of epochs (default: 5000)
--finish_after INT	maximum number of updates (minibatches) (default: 10000000)
--decay_c FLOAT	L2 regularization penalty (default: 0)
--map_decay_c FLOAT	MAP-L2 regularization penalty towards original weights (default: 0)
--prior_model STR	Prior model for MAP-L2 regularization. Unless using "--reload", this will also be used for initialization.
--clip_c FLOAT	gradient clipping threshold (default: 1)
--learning_rate FLOAT	learning rate (default: 0.0001)
--label_smoothing FLOAT	label smoothing (default: 0)
--no_shuffle	disable shuffling of training data (for each epoch)
--no_sort_by_length	do not sort sentences in maxibatch by length
--maxibatch_size INT	size of maxibatch (number of minibatches that are sorted by length) (default: 20)
--optimizer	optimizer (default: adam)
--keep_train_set_in_memory	Keep training dataset lines stores in RAM during training

validation parameters

parameter	description
--valid_source_dataset PATH	parallel validation corpus (source side)
--valid_target_dataset PATH	parallel validation corpus (target side)
--valid_batch_size INT	validation minibatch size (default: 80)
--valid_token_batch_size INT	validation minibatch size (expressed in number of source or target tokens). Sentence-level minibatch size will be dynamic. If this is enabled, valid_batch_size only affects sorting by length.
--validFreq INT	validation frequency (default: 10000)
--patience INT	early stopping patience (default: 10)
--run_validation	Compute validation score on validation dataset

display parameters

parameter	description
--dispFreq INT	display loss after INT updates (default: 1000)
--sampleFreq INT	display some samples after INT updates (default: 10000)
--beamFreq INT	display some beam_search samples after INT updates (default: 10000)
--beam_size INT	size of the beam (default: 12)

nematus/translate.py : use an existing model to translate a source text

parameter	description
-k K	Beam size (default: 5))
-p P	Number of processes (default: 5))
-n	Normalize scores by sentence length
-v	verbose mode.
--models MODELS [MODELS ...], -m MODELS [MODELS ...]	model to use. Provide multiple models (with same vocabulary) for ensemble decoding
--input PATH, -i PATH	Input file (default: standard input)
--output PATH, -o PATH	Output file (default: standard output)
--n-best	Write n-best list (of size k)

nematus/score.py : use an existing model to score a parallel corpus

parameter	description
-b B	Minibatch size (default: 80))
-n	Normalize scores by sentence length
-v	verbose mode.
--models MODELS [MODELS ...], -m MODELS [MODELS ...]	model to use. Provide multiple models (with same vocabulary) for ensemble decoding
--source PATH, -s PATH	Source text file
--target PATH, -t PATH	Target text file
--output PATH, -o PATH	Output file (default: standard output)

nematus/rescore.py : use an existing model to rescore an n-best list.

The n-best list is assumed to have the same format as Moses:

sentence-ID (starting from 0) ||| translation ||| scores

new scores will be appended to the end. rescore.py has the same arguments as score.py, with the exception of this additional parameter:

parameter	description
--input PATH, -i PATH	Input n-best list file (default: standard input)

nematus/theano_tf_convert.py : convert an existing theano model to a tensorflow model

If you have a Theano model (model.npz) with network architecture features that are currently supported then you can convert it into a tensorflow model using nematus/theano_tf_convert.py.

PUBLICATIONS

if you use Nematus, please cite the following paper:

Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry and Maria Nadejde (2017): Nematus: a Toolkit for Neural Machine Translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 65-68.

@InProceedings{sennrich-EtAl:2017:EACLDemo,
  author    = {Sennrich, Rico  and  Firat, Orhan  and  Cho, Kyunghyun  and  Birch, Alexandra  and  Haddow, Barry  and  Hitschler, Julian  and  Junczys-Dowmunt, Marcin  and  L\"{a}ubli, Samuel  and  Miceli Barone, Antonio Valerio  and  Mokry, Jozef  and  Nadejde, Maria},
  title     = {Nematus: a Toolkit for Neural Machine Translation},
  booktitle = {Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics},
  month     = {April},
  year      = {2017},
  address   = {Valencia, Spain},
  publisher = {Association for Computational Linguistics},
  pages     = {65--68},
  url       = {http://aclweb.org/anthology/E17-3017}
}

the code is based on the following model:

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2015): Neural Machine Translation by Jointly Learning to Align and Translate, Proceedings of the International Conference on Learning Representations (ICLR).

please refer to the Nematus paper for a description of implementation differences

ACKNOWLEDGMENTS

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements 645452 (QT21), 644333 (TraMOOC), 644402 (HimL) and 688139 (SUMMA).

2、切换到WMT17的sample（这份sample在WMT17的比赛中被使用详见3）

Index of /wmt17_systems/training

Name	Last modified	Size

Parent Directory		-
data/	2017-07-24 11:16	-
downloads/	2017-07-24 11:16	-
model/	2017-07-24 11:16	-
scripts.tensorflow/	2018-07-16 09:11	-
scripts/	2018-05-25 16:29	-
vars	2017-07-20 10:36	14

WMT17 TRAINING SCRIPTS

We used various different approaches for preprocessing and data augmentation for monolingual data for different languages. Check the system description for more detail. //用了各种不同的方法来预处理，对于不同语言的单语数据增加需要查看更多细节描述。

In this directory, we provide a sample configuration for preprocessing and training for English->German.

//在这份目录里我们提供的预处理和训练的样本配置是en-ge（英语→德语）

Please note that this script will not reproduce our WMT17 results, which also rely on the use of back-translated monolingual data, and combination of multiple models. //注意结果不一定会复现WMT17的结果，因为它也依赖于回译单语数据的使用和多个模型的组合

Please also have a look at last year’s accompanying scripts and sample configurations; among others, there is documentation for right-to-left reranking: https://github.com/rsennrich/wmt16-scripts //可以看去年的样本，也有一些文件从右到左排序

//可以也去看看16的脚本和样本配置，

Note: since the WMT17 models were developed, Nematus has switched from using a Theano back-end to using TensorFlow. The scripts provided in the scripts directory are for use with the Theano version; updated scripts for use with the current TensorFlow version can be found in scripts.tensorflow.// 注意：因为WMT17模型又拓展了，nematus也从使用theano后端到了TensorFlow后端。theano版本的脚本被提供在script文件夹，升级后的脚本在script.TensorFlow文件夹。

USAGE INSTRUCTIONS

download sample files (WMT17 parallel training data, dev and test sets):
```
scripts/download_files.sh
```
preprocess the training, development and test corpora:
```
 scripts/preprocess.sh
```
train a Nematus model:
```
 scripts/train.sh
```
evaluate your model:
```
 scripts/evaluate.sh
```

3、爱丁堡大学的WMT17系统

THE UNIVERSITY OF EDINBURGH’S WMT17 SYSTEMS

This directory contains some of the University of Edinburgh’s submissions to the WMT17 shared translation task, and a ‘training’ directory with scripts to preprocess and train your own model.//这个目录包括爱丁堡大学在WMT17会议上共享的翻译任务，有一个‘training’目录和脚本在一起来预处理和训练你自己的模型

If you are accessing this through a git repository, it will contain all scripts and documentation, but no model files - the models are accessible at http://data.statmt.org/wmt17_systems //如果你想通过git仓库访问它，git文件里会包括所有script和document，但是没有模型文件，模型可以上面这个网址被找到

Use the git repository to keep track of changes to this directory: https://github.com/EdinburghNLP/wmt17-scripts //这里是git的地址

REQUIREMENTS

The models use the following software: //这个模型需要以下的准备

moses decoder (scripts only; no compilation required) https://github.com/moses-smt/mosesdecoder //moses decoder
nematus (theano version): https://github.com/EdinburghNLP/nematus/tree/theano //nematus （theano）
subword-nmt https://github.com/rsennrich/subword-nmt //分词

Please set the appropriate paths in the ‘vars’ file. //在vars文件设置合适的路径

DOWNLOAD INSTRUCTIONS

you can download all files in this directory with this command: //用下面这个命令可以下载所有的文件

wget -r -e robots=off -nH -np -R index.html* http://data.statmt.org/wmt17_systems/

to download just one language pair (such as en-de), execute://或者只下载一个语种

wget -r -e robots=off -nH -np -R index.html* http://data.statmt.org/wmt17_systems/en-de/

to download just a single model (approx 2GB) and the corresponding translation scripts, ignoring ensembles, execute://只下载某个模型和对应的翻译脚本（无视总体）

wget -r -e robots=off -nH -np -R *ens2* -R *ens3* -R *ens4* -R *r2l* -R translate-ensemble.sh -R translate-reranked.sh -R index.html* http://data.statmt.org/wmt17_systems/en-de/

if you only download selected language pairs or models, you should also download these files which are shared: //如果你只下载某个语言，别忘了下载共享的其他文件

wget -r -e robots=off -nH -np -R index.html* http://data.statmt.org/wmt17_systems/scripts/ http://data.statmt.org/wmt17_systems/vars

USAGE INSTRUCTIONS: PRE-TRAINED MODELS//预处理模型的使用

first, ensure that all requirements are present, and that the path names in the ‘vars’ file are up-to-date. If you want to decode on a GPU, you can also update the ‘device’ variable in that file.//首先，确保所有需要的文件都准备好了，然后vars里的数据都更新了，如果你想在GPU上decode，记得更新device变量

each subdirectory comes with several scripts translate-*.sh. //所有来自几个script的子目录都叫trans-*.sh

For translation with a single model, execute://为了训练一个单语模型

./translate-single.sh < your_input_file > your_output_file

the input should be UTF-8 plain text in the source language, one sentence per line. //输入是UTP-8的普通句子一个句子一行

We also provide ensembles of left-to-right models: //总体的从左向右模型

./translate-ensemble.sh < your_input_file > your_output_file

For some language pairs, we built models that use right-to-left models for reranking: //对于很多语言对，还有自右向左的重排

./translate-reranked.sh < your_input_file > your_output_file

We used systems that include ensembles and right-to-left reranking for our official submissions; //我们用包括总体和自右向左的重排序来进行正式提交

result may vary slightly from the official submissions due to post-submission improvements - see the shared task description for more details.//结果可能会轻微的取决于后提交改进，可以看共享任务描述获取更多信息

USAGE INSTRUCTIONS: TRAINING SCRIPTS

For training your own models, follow the instructions in training/README.md （转2）

LICENSE

All scripts in this directory are distributed under MIT license.

The use of the models provided in this directory is permitted under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license (CC BY-NC-SA 3.0):https://creativecommons.org/licenses/by-nc-sa/3.0/

Attribution - You must give appropriate credit [please use the citation below], provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

NonCommercial - You may not use the material for commercial purposes.

ShareAlike - If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

REFERENCE

The models are described in the following publication:

Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, and Philip Williams (2017). “The University of Edinburgh’s Neural MT Systems for WMT17”. In: Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers. Copenhagen, Denmark.

@inproceedings{uedin-nmt:2017,
    address = "Copenhagen, Denmark",
    author = "Sennrich, Rico and Birch, Alexandra and Currey, Anna and 
              Germann, Ulrich and Haddow, Barry and Heafield, Kenneth and 
              {Miceli Barone}, Antonio Valerio and Williams, Philip",
    booktitle = "{Proceedings of the Second Conference on Machine Translation, 
                 Volume 2: Shared Task Papers}",
    title = "{The University of Edinburgh's Neural MT Systems for WMT17}",
    year = "2017"
}

bobaonemimimii

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
2
评论
NMT-nametus入门了解（基于TensorFlow的版本）

1、首先看github上的readme： NEMATUSAttention-based encoder-decoder model for neural machine translation //基于注意力编码解码模型This package is based on the dl4mt-tutorial by Kyunghyun Cho et al. ( https://...
复制链接

扫一扫