Nematus（一）数据预处理与超参数配置

最新推荐文章于 2024-05-17 09:41:34 发布

nlpming

最新推荐文章于 2024-05-17 09:41:34 发布

阅读量1.7k

点赞数

分类专栏： Machine Translation 文章标签： neural machine translation nematus theano

本文链接：https://blog.csdn.net/abc13310086/article/details/79189019

版权

Machine Translation 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

神经机器翻译工具Nematus

代码：https://github.com/czhiming/Nematus

1、数据预处理 ./preprocess.sh

主要流程包括：

tokenization（符号化处理）

This means that spaces have to be inserted between (e.g.) words and punctuation.

#!/bin/sh

# suffix of source language files
SRC=en
SRCTAG=en
# suffix of target language files
TRG=es
TRGTAG=es
# number of merge operations. Network vocabulary should be slightly larger (to include characters),
# or smaller if the operations are learned on the joint vocabulary
#bpe_operations=89500
bpe_operations=45000

tools=tools
# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=$tools/moses
# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=$tools/subword-nmt 

# tokenize
for prefix in train dev
 do
   cat data/$prefix.$SRCTAG | \
   $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \
   $tools/normalise-romanian.py | \
   $tools/remove-diacritics.py | \
   $tools/tokenizer.perl -a -l $SRC > data/$prefix.tok.$SRCTAG

   cat data/$prefix.$TRGTAG | \
   $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $TRG | \
   $tools/tokenizer.perl -a -l $TRG > data/$prefix.tok.$TRGTAG

 done

clean(去除长度大于50的句子)

Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously mis-aligned sentences are removed.

#############################################
# clean empty and long sentences, and sentences with high source-target ratio (training corpus only)
$mosesdecoder/scripts/training/clean-corpus-n.perl data/train.tok $SRCTAG $TRGTAG data/train.tok.clean 1 50 #80

truecase（起始词首字母小写,名字除外）

The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.

# train truecaser
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/train.tok.clean.$SRCTAG -model model/truecase-model.$SRCTAG
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/train.tok.clean.$TRGTAG -model model/truecase-model.$TRGTAG

# apply truecaser (cleaned training corpus)
for prefix in train
 do
  $tools/scripts/truecase.perl -model model/truecase-model.$SRCTAG < data/$prefix.tok.clean.$SRCTAG > data/$prefix.tc.$SRCTAG
  $tools/scripts/truecase.perl -model model/truecase-model.$TRGTAG < data/$prefix.tok.clean.$TRGTAG > data/$prefix.tc.$TRGTAG
 done

# apply truecaser (dev/test files)
for prefix in dev
 do
  $tools/scripts/truecase.perl -model model/truecase-model.$SRCTAG < data/$prefix.tok.$SRCTAG > data/$prefix.tc.$SRCTAG
  $tools/scripts/truecase.perl -model model/truecase-model.$TRGTAG < data/$prefix.tok.$TRGTAG > data/$prefix.tc.$TRGTAG
 done

bpe（使用BPE算法对单词进行编码）
- 处理前：Lea Michele ’ s On Her Way To The Top Of The Charts With This Song !
- 处理后：Le@@ a Michele ’ s On Her Way To The Top Of The Char@@ ts With This Song !
  
  参考论文： Neural Machine Translation of Rare Words with Subword Units
  http://www.aclweb.org/anthology/P/P16/P16-1162.pdf

# train BPE   源语言和目标语言分别单独训练 bpe
cat data/train.tc.$SRCTAG | $subword_nmt/learn_bpe.py -s $bpe_operations > model/$SRCTAG.bpe
cat data/train.tc.$TRGTAG | $subword_nmt/learn_bpe.py -s $bpe_operations > model/$TRGTAG.bpe

# apply BPE

for prefix in train dev
 do
  $tools/scripts/apply_bpe.py -c model/$SRCTAG.bpe < data/$prefix.tc.$SRCTAG > data/$prefix.bpe.$SRCTAG
  $tools/scripts/apply_bpe.py -c model/$TRGTAG.bpe < data/$prefix.tc.$TRGTAG > data/$prefix.bpe.$TRGTAG
 done

build vocabulary（建立词汇表）

# build network dictionary
python $tools/build_dictionary.py data/train.bpe.$SRCTAG data/train.bpe.$TRGTAG

2、配置网络的超参数（Hypeparameter） ./config.py

主要参数包括：

reload: 是否重载模型，隔一段时间保存模型防止程序异常中断，比如停电等。
dim_word: 词向量维数
dim: 隐状态维数，或者称之为隐状态大小（hidden unit size）
n_words_src: 源语言词汇表大小
n_words_tgt: 目标语言词汇表大小
decay_c: 正则化因子 $\lambda$
patience: 用于 early stop
lrate: 初始学习率
optimizer: 优化器
maxlen: 训练语料句子最大长度
batch_size: 批量大小，若GPU显存很小应该设置尽量小的值，防止出现out of memory
datasets: 训练集
valid_datasets: 验证集
dictionaries: 词汇表存放的文件
validFreq: 验证的频率
dispFreq: 输出提示信息的频率
saveFreq: 保存模型的频率
sampleFreq: 验证集输出频率（即输出一些源语言句子、目标语言句子、参考译文的一些示例）
use_dropout: 是否使用dropout
overwrite: 是否保存模型的中间结果，False保存，True不保存
external_validation_script: 验证时调用的脚本程序

import numpy
import os
import sys

#此处根据需要修改，看词汇表的个数多少
VOCAB_SIZE_SRC = 40000
VOCAB_SIZE_TGT = 40000
SRCTAG = "en"
TRGTAG = "es"
DATA_DIR = "data"
TUNING_DIR = "tuning"
model = "model_en_es"

from nematus.nmt import train


if __name__ == '__main__':
    validerr = train(saveto=model+'/model.npz',
                    reload_=True,
                    dim_word=500,
                    dim=1024,
                    n_words_tgt=VOCAB_SIZE_TGT,
                    n_words_src=VOCAB_SIZE_SRC,
                    decay_c=0.,
                    clip_c=1.,
                    patience=10, #early stop patience
                    lrate=0.0001,
                    optimizer='adam', #adam,adadelta
                    maxlen=50,
                    batch_size=80,
                    valid_batch_size=80,
                    datasets=[DATA_DIR + '/train.bpe.' + SRCTAG, DATA_DIR + '/train.bpe.' + TRGTAG],
                    valid_datasets=[DATA_DIR + '/dev.bpe.' + SRCTAG, DATA_DIR + '/dev.bpe.' + TRGTAG],
                    dictionaries=[DATA_DIR + '/train.bpe.' + SRCTAG + '.json',DATA_DIR + '/train.bpe.' + TRGTAG + '.json'],
                    validFreq=10000, #10000,3000
                    dispFreq=1000,  #1000,100
                    saveFreq=30000, #30000,10000
                    #sampleFreq=10000,
                    sampleFreq=0,  #不产生样本
                    use_dropout=True,
                    dropout_embedding=0.2, # dropout for input embeddings (0: no dropout)
                    dropout_hidden=0.2, # dropout for hidden layers (0: no dropout)
                    dropout_source=0.1, # dropout source words (0: no dropout)
                    dropout_target=0.1, # dropout target words (0: no dropout)
                    overwrite=False,
                    external_validation_script='./validate.sh')
    print validerr

nlpming

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Nematus（一）数据预处理与超参数配置

神经机器翻译工具Nematus 代码：https://github.com/czhiming/Nematus1、数据预处理 ./preprocess.sh主要流程包括：tokenization（符号化处理） This means that spaces have to be inserted between (e.g.) words and punctu
复制链接

扫一扫

专栏目录