NMT-nametus入门了解(基于TensorFlow的版本)

 

1、首先看github上的readme:

 

NEMATUS

Attention-based encoder-decoder model for neural machine translation //基于注意力编码解码 模型

This package is based on the dl4mt-tutorial by Kyunghyun Cho et al. ( https://github.com/nyu-dl/dl4mt-tutorial ). It was used to produce top-scoring systems at the WMT 16 shared translation task.//基于在WMT16里获得优秀表现的dl4mt

The changes to Nematus include:

  • the model has been re-implemented in tensorflow. See https://github.com/EdinburghNLP/nematus/tree/theano for the Theano-based version of Nematus.//使用TensorFlow重构了

  • new architecture variants for better performance: //新的变量结构带来更好的表现

  • improvements to scoring and decoding:

    • n-best output for decoder //n-best 输出(输出最好的n个)
    • scripts for scoring (given parallel corpus) and rescoring (of n-best output) //改进了打分脚本 
  • usability improvements:

    • command line interface for training //训练的命令行接口 
    • vocabulary files and model parameters are stored in JSON format (backward-compatible loading) //单词向下兼容(兼容以前的),是json格式
    • server mode

see changelog for more info. //看changelog有更多信息

SUPPORT

For general support requests, there is a Google Groups mailing list at https://groups.google.com/d/forum/nematus-support . You can also send an e-mail to nematus-support@googlegroups.com .

INSTALLATION

Nematus requires the following packages: //安装需要

  • Python >= 2.7
  • tensorflow //TensorFlow

To install tensorflow, we recommend following the steps at: ( https://www.tensorflow.org/install/ ) //安装TensorFlow的步骤

the following packages are optional, but highly recommended //如果是gpu运行的话推荐安装cuda

  • CUDA >= 7 (only GPU training is sufficiently fast)
  • cuDNN >= 4 (speeds up training substantially)

DOCKER USAGE//docker:一种虚拟环境打包运行的方式

docker简介:https://baike.baidu.com/item/Docker/13344470?fr=aladdin

http://www.runoob.com/docker/docker-tutorial.html

You can also create docker image by running following command, where you change suffix to either cpu or gpu:

docker build -t nematus-docker -f Dockerfile.suffix .

To run a CPU docker instance with the current working directory shared with the Docker container, execute:

docker run -v `pwd`:/playground -it nematus-docker

For GPU you need to have nvidia-docker installed and run:

nvidia-docker run -v `pwd`:/playground -it nematus-docker

TRAINING SPEED

Training speed depends heavily on having appropriate hardware (ideally a recent NVIDIA GPU), and having installed the appropriate software packages.

To test your setup, we provide some speed benchmarks with `test/test_train.sh', on an Intel Xeon CPU E5-2620 v4, with a Nvidia GeForce GTX Titan X (Pascal) and CUDA 9.0:

GPU, CuDNN 5.1, tensorflow 1.0.1:

CUDA_VISIBLE_DEVICES=0 ./test_train.sh

225.25 sentenses/s

USAGE INSTRUCTIONS

All of the scripts below can be run with --help flag to get usage information. //可以在对所有脚本使用 Python --help

Sample commands with toy examples are available in the test directory; for training a full-scale system, consider the training scripts at http://data.statmt.org/wmt17_systems/training/

// 为了训练一个完整的系统,需要考虑WMT17里面的sample(转2)

接下来是具体的模型和参数介绍:

nematus/nmt.py : use to train a new model //训练新模型的nmt.py

data sets; model loading and saving //数据集模型的装载与保存

parameterdescription
--source_dataset PATHparallel training corpus (source side)
--target_dataset PATHparallel training corpus (target side)
--dictionaries PATH [PATH ...]network vocabularies (one per source factor, plus target vocabulary)
--model PATHmodel file name (default: model.npz)
--saveFreq INTsave frequency (default: 30000)
--reloadload existing model (if '--model' points to existing model)
--no_reload_training_progressdon't reload training progress (only used if --reload is enabled)
--summary_dirdirectory for saving summaries (default: same directory as the --saveto file)
--summaryFreqSave summaries after INT updates, if 0 do not save summaries (default: 0)

network parameters

parameterdescription
--embedding_size INTembedding layer size (default: 512)
--state_size INThidden layer size (default: 1000)
--source_vocab_sizes INTsource vocabulary sizes (one per input factor) (default: None)
--target_vocab_size INTtarget vocabulary size (default: None)
--factors INTnumber of input factors (default: 1)
--dim_per_factor INT [INT ...]list of word vector dimensionalities (one per factor): '--dim_per_factor 250 200 50' for total dimensionality of 500 (default: None)
--use_dropoutuse dropout layer (default: False)
--dropout_embedding FLOATdropout for input embeddings (0: no dropout) (default: 0.2)
--dropout_hidden FLOATdropout for hidden layer (0: no dropout) (default: 0.2)
--dropout_source FLOATdropout source words (0: no dropout) (default: 0)
--dropout_target FLOATdropout target words (0: no dropout) (default: 0)
--layer_normalisationuse layer normalisation (default: False)
--tie_decoder_embeddingstie the input embeddings of the decoder with the softmax output embeddings
--enc_depth INTnumber of encoder layers (default: 1)
--enc_recurrence_transition_depthnumber of GRU transition operations applied in an encoder layer (default: 1)
--dec_depth INTnumber of decoder layers (default: 1)
--dec_base_recurrence_transition_depthnumber of GRU transition operations applied in first decoder layer (default: 2)
--dec_high_recurrence_transition_depthnumber of GRU transition operations applied in decoder layers after the first (default: 1)
--dec_deep_contextpass context vector (from first layer) to deep decoder layers
--output_hidden_activationactivation function in hidden layer of the output network (default: tanh)

training parameters

parameterdescription
--maxlen INTmaximum sequence length (default: 100)
--batch_size INTminibatch size (default: 80)
--token_batch_size INTminibatch size (expressed in number of source or target tokens). Sentence-level minibatch size will be dynamic. If this is enabled, batch_size only affects sorting by length.
--max_epochs INTmaximum number of epochs (default: 5000)
--finish_after INTmaximum number of updates (minibatches) (default: 10000000)
--decay_c FLOATL2 regularization penalty (default: 0)
--map_decay_c FLOATMAP-L2 regularization penalty towards original weights (default: 0)
--prior_model STRPrior model for MAP-L2 regularization. Unless using "--reload", this will also be used for initialization.
--clip_c FLOATgradient clipping threshold (default: 1)
--learning_rate FLOATlearning rate (default: 0.0001)
--label_smoothing FLOATlabel smoothing (default: 0)
--no_shuffledisable shuffling of training data (for each epoch)
--no_sort_by_lengthdo not sort sentences in maxibatch by length
--maxibatch_size INTsize of maxibatch (number of minibatches that are sorted by length) (default: 20)
--optimizeroptimizer (default: adam)
--keep_train_set_in_memoryKeep training dataset lines stores in RAM during training

validation parameters

parameterdescription
--valid_source_dataset PATHparallel validation corpus (source side)
--valid_target_dataset PATHparallel validation corpus (target side)
--valid_batch_size INTvalidation minibatch size (default: 80)
--valid_token_batch_size INTvalidation minibatch size (expressed in number of source or target tokens). Sentence-level minibatch size will be dynamic. If this is enabled, valid_batch_size only affects sorting by length.
--validFreq INTvalidation frequency (default: 10000)
--patience INTearly stopping patience (default: 10)
--run_validationCompute validation score on validation dataset

display parameters

parameterdescription
--dispFreq INTdisplay loss after INT updates (default: 1000)
--sampleFreq INTdisplay some samples after INT updates (default: 10000)
--beamFreq INTdisplay some beam_search samples after INT updates (default: 10000)
--beam_size INTsize of the beam (default: 12)

nematus/translate.py : use an existing model to translate a source text

parameterdescription
-k KBeam size (default: 5))
-p PNumber of processes (default: 5))
-nNormalize scores by sentence length
-vverbose mode.
--models MODELS [MODELS ...], -m MODELS [MODELS ...]model to use. Provide multiple models (with same vocabulary) for ensemble decoding
--input PATH, -i PATHInput file (default: standard input)
--output PATH, -o PATHOutput file (default: standard output)
--n-bestWrite n-best list (of size k)

nematus/score.py : use an existing model to score a parallel corpus

parameterdescription
-b BMinibatch size (default: 80))
-nNormalize scores by sentence length
-vverbose mode.
--models MODELS [MODELS ...], -m MODELS [MODELS ...]model to use. Provide multiple models (with same vocabulary) for ensemble decoding
--source PATH, -s PATHSource text file
--target PATH, -t PATHTarget text file
--output PATH, -o PATHOutput file (default: standard output)

nematus/rescore.py : use an existing model to rescore an n-best list.

The n-best list is assumed to have the same format as Moses:

sentence-ID (starting from 0) ||| translation ||| scores

new scores will be appended to the end. rescore.py has the same arguments as score.py, with the exception of this additional parameter:

parameterdescription
--input PATH, -i PATHInput n-best list file (default: standard input)

nematus/theano_tf_convert.py : convert an existing theano model to a tensorflow model

If you have a Theano model (model.npz) with network architecture features that are currently supported then you can convert it into a tensorflow model using nematus/theano_tf_convert.py.

PUBLICATIONS

if you use Nematus, please cite the following paper:

Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry and Maria Nadejde (2017): Nematus: a Toolkit for Neural Machine Translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, pp. 65-68.

@InProceedings{sennrich-EtAl:2017:EACLDemo,
  author    = {Sennrich, Rico  and  Firat, Orhan  and  Cho, Kyunghyun  and  Birch, Alexandra  and  Haddow, Barry  and  Hitschler, Julian  and  Junczys-Dowmunt, Marcin  and  L\"{a}ubli, Samuel  and  Miceli Barone, Antonio Valerio  and  Mokry, Jozef  and  Nadejde, Maria},
  title     = {Nematus: a Toolkit for Neural Machine Translation},
  booktitle = {Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics},
  month     = {April},
  year      = {2017},
  address   = {Valencia, Spain},
  publisher = {Association for Computational Linguistics},
  pages     = {65--68},
  url       = {http://aclweb.org/anthology/E17-3017}
}

the code is based on the following model:

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio (2015): Neural Machine Translation by Jointly Learning to Align and Translate, Proceedings of the International Conference on Learning Representations (ICLR).

please refer to the Nematus paper for a description of implementation differences

ACKNOWLEDGMENTS

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreements 645452 (QT21), 644333 (TraMOOC), 644402 (HimL) and 688139 (SUMMA).

  • © 2018 GitHub, Inc

 

2、切换到WMT17的sample(这份sample在WMT17的比赛中被使用详见3)

Index of /wmt17_systems/training

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory - 
[DIR]data/2017-07-24 11:16- 
[DIR]downloads/2017-07-24 11:16- 
[DIR]model/2017-07-24 11:16- 
[DIR]scripts.tensorflow/2018-07-16 09:11- 
[DIR]scripts/2018-05-25 16:29- 
[   ]vars2017-07-20 10:3614 

WMT17 TRAINING SCRIPTS

We used various different approaches for preprocessing and data augmentation for monolingual data for different languages. Check the system description for more detail. //用了各种不同的方法来预处理,对于不同语言的单语数据增加需要查看更多细节描述。

In this directory, we provide a sample configuration for preprocessing and training for English->German.

//在这份目录里我们提供的预处理和训练的样本配置是en-ge(英语→德语)

Please note that this script will not reproduce our WMT17 results, which also rely on the use of back-translated monolingual data, and combination of multiple models. //注意结果不一定会复现WMT17的结果,因为它也依赖于回译单语数据的使用和多个模型的组合

Please also have a look at last year’s accompanying scripts and sample configurations; among others, there is documentation for right-to-left reranking: https://github.com/rsennrich/wmt16-scripts //可以看去年的样本,也有一些文件从右到左排序

//可以也去看看16的脚本和样本配置,

Note: since the WMT17 models were developed, Nematus has switched from using a Theano back-end to using TensorFlow. The scripts provided in the scripts directory are for use with the Theano version; updated scripts for use with the current TensorFlow version can be found in scripts.tensorflow.// 注意:因为WMT17模型又拓展了,nematus也从使用theano后端到了TensorFlow后端。theano版本的脚本被提供在script文件夹,升级后的脚本在script.TensorFlow文件夹。

 

USAGE INSTRUCTIONS

  1. download sample files (WMT17 parallel training data, dev and test sets):

    scripts/download_files.sh
    
  2. preprocess the training, development and test corpora:

     scripts/preprocess.sh
    
  3. train a Nematus model:

     scripts/train.sh
    
  4. evaluate your model:

     scripts/evaluate.sh

 

 

3、爱丁堡大学的WMT17系统

THE UNIVERSITY OF EDINBURGH’S WMT17 SYSTEMS


This directory contains some of the University of Edinburgh’s submissions to the WMT17 shared translation task, and a ‘training’ directory with scripts to preprocess and train your own model.//这个目录包括爱丁堡大学在WMT17会议上共享的翻译任务,有一个‘training’目录和脚本在一起来预处理和训练你自己的模型

If you are accessing this through a git repository, it will contain all scripts and documentation, but no model files - the models are accessible at http://data.statmt.org/wmt17_systems //如果你想通过git仓库访问它,git文件里会包括所有script和document,但是没有模型文件,模型可以上面这个网址被找到

Use the git repository to keep track of changes to this directory: https://github.com/EdinburghNLP/wmt17-scripts //这里是git的地址

REQUIREMENTS

The models use the following software: //这个模型需要以下的准备

Please set the appropriate paths in the ‘vars’ file. //在vars文件设置合适的路径

DOWNLOAD INSTRUCTIONS

you can download all files in this directory with this command: //用下面这个命令可以下载所有的文件

wget -r -e robots=off -nH -np -R index.html* http://data.statmt.org/wmt17_systems/

to download just one language pair (such as en-de), execute://或者只下载一个语种

wget -r -e robots=off -nH -np -R index.html* http://data.statmt.org/wmt17_systems/en-de/

to download just a single model (approx 2GB) and the corresponding translation scripts, ignoring ensembles, execute://只下载某个模型和对应的翻译脚本(无视总体)

wget -r -e robots=off -nH -np -R *ens2* -R *ens3* -R *ens4* -R *r2l* -R translate-ensemble.sh -R translate-reranked.sh -R index.html* http://data.statmt.org/wmt17_systems/en-de/

if you only download selected language pairs or models, you should also download these files which are shared: //如果你只下载某个语言,别忘了下载共享的其他文件

wget -r -e robots=off -nH -np -R index.html* http://data.statmt.org/wmt17_systems/scripts/ http://data.statmt.org/wmt17_systems/vars

USAGE INSTRUCTIONS: PRE-TRAINED MODELS//预处理模型的使用

first, ensure that all requirements are present, and that the path names in the ‘vars’ file are up-to-date. If you want to decode on a GPU, you can also update the ‘device’ variable in that file.//首先,确保所有需要的文件都准备好了,然后vars里的数据都更新了,如果你想在GPU上decode,记得更新device变量

each subdirectory comes with several scripts translate-*.sh. //所有来自几个script的子目录都叫trans-*.sh

For translation with a single model, execute://为了训练一个单语模型

./translate-single.sh < your_input_file > your_output_file

the input should be UTF-8 plain text in the source language, one sentence per line. //输入是UTP-8的普通句子 一个句子一行

We also provide ensembles of left-to-right models: //总体的从左向右模型

./translate-ensemble.sh < your_input_file > your_output_file

For some language pairs, we built models that use right-to-left models for reranking: //对于很多语言对,还有自右向左的重排

./translate-reranked.sh < your_input_file > your_output_file

We used systems that include ensembles and right-to-left reranking for our official submissions; //我们用包括总体和自右向左的重排序来进行正式提交

result may vary slightly from the official submissions due to post-submission improvements - see the shared task description for more details.//结果可能会轻微的取决于后提交改进,可以看共享任务描述获取更多信息

USAGE INSTRUCTIONS: TRAINING SCRIPTS

For training your own models, follow the instructions in training/README.md (转2)

LICENSE

All scripts in this directory are distributed under MIT license.

The use of the models provided in this directory is permitted under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license (CC BY-NC-SA 3.0):https://creativecommons.org/licenses/by-nc-sa/3.0/

Attribution - You must give appropriate credit [please use the citation below], provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

NonCommercial - You may not use the material for commercial purposes.

ShareAlike - If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

REFERENCE

The models are described in the following publication:

Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, and Philip Williams (2017). “The University of Edinburgh’s Neural MT Systems for WMT17”. In: Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers. Copenhagen, Denmark.

@inproceedings{uedin-nmt:2017,
    address = "Copenhagen, Denmark",
    author = "Sennrich, Rico and Birch, Alexandra and Currey, Anna and 
              Germann, Ulrich and Haddow, Barry and Heafield, Kenneth and 
              {Miceli Barone}, Antonio Valerio and Williams, Philip",
    booktitle = "{Proceedings of the Second Conference on Machine Translation, 
                 Volume 2: Shared Task Papers}",
    title = "{The University of Edinburgh's Neural MT Systems for WMT17}",
    year = "2017"
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值