This means that spaces have to be inserted between (e.g.) words and punctuation.
#!/bin/sh# suffix of source language files
SRC=en
SRCTAG=en
# suffix of target language files
TRG=es
TRGTAG=es
# number of merge operations. Network vocabulary should be slightly larger (to include characters),# or smaller if the operations are learned on the joint vocabulary#bpe_operations=89500
bpe_operations=45000
tools=tools
# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=$tools/moses
# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=$tools/subword-nmt
# tokenizefor prefix in train dev
do
cat data/$prefix.$SRCTAG | \
$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC | \
$tools/normalise-romanian.py | \
$tools/remove-diacritics.py | \
$tools/tokenizer.perl -a -l $SRC > data/$prefix.tok.$SRCTAG
cat data/$prefix.$TRGTAG | \
$mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $TRG | \
$tools/tokenizer.perl -a -l $TRG > data/$prefix.tok.$TRGTAG
done
clean(去除长度大于50的句子)
Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously mis-aligned sentences are removed.
############################################## clean empty and long sentences, and sentences with high source-target ratio (training corpus only)
$mosesdecoder/scripts/training/clean-corpus-n.perl data/train.tok $SRCTAG $TRGTAG data/train.tok.clean 150#80
truecase(起始词首字母小写,名字除外)
The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.
神经机器翻译工具Nematus 代码:https://github.com/czhiming/Nematus1、数据预处理 ./preprocess.sh主要流程包括:tokenization(符号化处理) This means that spaces have to be inserted between (e.g.) words and punctu