Nematus(一)数据预处理与超参数配置

神经机器翻译工具Nematus

代码:https://github.com/czhiming/Nematus

1、数据预处理 ./preprocess.sh

主要流程包括:
  • tokenization(符号化处理)
    This means that spaces have to be inserted between (e.g.) words and punctuation.
#!/bin/sh

# suffix of source language files
SRC=en
SRCTAG=en
# suffix of target language files
TRG=es
TRGTAG=es
# number of merge operations. Network vocabulary should be slightly larger (to include characters),
# or smaller if the operations are learned on the joint vocabulary
#bpe_operations=89500
bpe_operations=45000

tools=tools
# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=$tools/moses
# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=$tools/subword-nmt 

# tokenize
for prefix in train dev
 do
   cat data/$prefix.$SRCTAG | \
   $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \
   $tools/normalise-romanian.py | \
   $tools/remove-diacritics.py | \
   $tools/tokenizer.perl -a -l $SRC > data/$prefix.tok.$SRCTAG

   cat data/$prefix.$TRGTAG | \
   $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $TRG | \
   $tools/tokenizer.perl -a -l $TRG > data/$prefix.tok.$TRGTAG

 done
  • clean(去除长度大于50的句子)
    Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously mis-aligned sentences are removed.
#############################################
# clean empty and long sentences, and sentences with high source-target ratio (training corpus only)
$mosesdecoder/scripts/training/clean-corpus-n.perl data/train.tok $SRCTAG $TRGTAG data/train.tok.clean 1 50 #80
  • truecase(起始词首字母小写,名字除外)
    The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.
# train truecaser
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/train.tok.clean.$SRCTAG -model model/truecase-model.$SRCTAG
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/train.tok.clean.$TRGTAG -model model/truecase-model.$TRGTAG

# apply truecaser (cleaned training corpus)
for prefix in train
 do
  $tools/scripts/truecase.perl -model model/truecase-model.$SRCTAG < data/$prefix.tok.clean.$SRCTAG > data/$prefix.tc.$SRCTAG
  $tools/scripts/truecase.perl -model model/truecase-model.$TRGTAG < data/$prefix.tok.clean.$TRGTAG > data/$prefix.tc.$TRGTAG
 done

# apply truecaser (dev/test files)
for prefix in dev
 do
  $tools/scripts/truecase.perl -model model/truecase-model.$SRCTAG < data/$prefix.tok.$SRCTAG > data/$prefix.tc.$SRCTAG
  $tools/scripts/truecase.perl -model model/truecase-model.$TRGTAG < data/$prefix.tok.$TRGTAG > data/$prefix.tc.$TRGTAG
 done
  • bpe(使用BPE算法对单词进行编码)
    • 处理前:Lea Michele ’ s On Her Way To The Top Of The Charts With This Song !
    • 处理后:Le@@ a Michele ’ s On Her Way To The Top Of The Char@@ ts With This Song !
      参考论文: Neural Machine Translation of Rare Words with Subword Units
      http://www.aclweb.org/anthology/P/P16/P16-1162.pdf
# train BPE   源语言和目标语言分别单独训练 bpe
cat data/train.tc.$SRCTAG | $subword_nmt/learn_bpe.py -s $bpe_operations > model/$SRCTAG.bpe
cat data/train.tc.$TRGTAG | $subword_nmt/learn_bpe.py -s $bpe_operations > model/$TRGTAG.bpe

# apply BPE

for prefix in train dev
 do
  $tools/scripts/apply_bpe.py -c model/$SRCTAG.bpe < data/$prefix.tc.$SRCTAG > data/$prefix.bpe.$SRCTAG
  $tools/scripts/apply_bpe.py -c model/$TRGTAG.bpe < data/$prefix.tc.$TRGTAG > data/$prefix.bpe.$TRGTAG
 done
  • build vocabulary(建立词汇表)
# build network dictionary
python $tools/build_dictionary.py data/train.bpe.$SRCTAG data/train.bpe.$TRGTAG

2、配置网络的超参数(Hypeparameter) ./config.py

主要参数包括:
  • reload: 是否重载模型,隔一段时间保存模型防止程序异常中断,比如停电等。
  • dim_word: 词向量维数
  • dim: 隐状态维数,或者称之为隐状态大小(hidden unit size)
  • n_words_src: 源语言词汇表大小
  • n_words_tgt: 目标语言词汇表大小
  • decay_c: 正则化因子 λ λ
  • patience: 用于 early stop
  • lrate: 初始学习率
  • optimizer: 优化器
  • maxlen: 训练语料句子最大长度
  • batch_size: 批量大小,若GPU显存很小应该设置尽量小的值,防止出现out of memory
  • datasets: 训练集
  • valid_datasets: 验证集
  • dictionaries: 词汇表存放的文件
  • validFreq: 验证的频率
  • dispFreq: 输出提示信息的频率
  • saveFreq: 保存模型的频率
  • sampleFreq: 验证集输出频率(即输出一些源语言句子、目标语言句子、参考译文的一些示例)
  • use_dropout: 是否使用dropout
  • overwrite: 是否保存模型的中间结果,False保存,True不保存
  • external_validation_script: 验证时调用的脚本程序
import numpy
import os
import sys

#此处根据需要修改,看词汇表的个数多少
VOCAB_SIZE_SRC = 40000
VOCAB_SIZE_TGT = 40000
SRCTAG = "en"
TRGTAG = "es"
DATA_DIR = "data"
TUNING_DIR = "tuning"
model = "model_en_es"

from nematus.nmt import train


if __name__ == '__main__':
    validerr = train(saveto=model+'/model.npz',
                    reload_=True,
                    dim_word=500,
                    dim=1024,
                    n_words_tgt=VOCAB_SIZE_TGT,
                    n_words_src=VOCAB_SIZE_SRC,
                    decay_c=0.,
                    clip_c=1.,
                    patience=10, #early stop patience
                    lrate=0.0001,
                    optimizer='adam', #adam,adadelta
                    maxlen=50,
                    batch_size=80,
                    valid_batch_size=80,
                    datasets=[DATA_DIR + '/train.bpe.' + SRCTAG, DATA_DIR + '/train.bpe.' + TRGTAG],
                    valid_datasets=[DATA_DIR + '/dev.bpe.' + SRCTAG, DATA_DIR + '/dev.bpe.' + TRGTAG],
                    dictionaries=[DATA_DIR + '/train.bpe.' + SRCTAG + '.json',DATA_DIR + '/train.bpe.' + TRGTAG + '.json'],
                    validFreq=10000, #10000,3000
                    dispFreq=1000,  #1000,100
                    saveFreq=30000, #30000,10000
                    #sampleFreq=10000,
                    sampleFreq=0,  #不产生样本
                    use_dropout=True,
                    dropout_embedding=0.2, # dropout for input embeddings (0: no dropout)
                    dropout_hidden=0.2, # dropout for hidden layers (0: no dropout)
                    dropout_source=0.1, # dropout source words (0: no dropout)
                    dropout_target=0.1, # dropout target words (0: no dropout)
                    overwrite=False,
                    external_validation_script='./validate.sh')
    print validerr
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值