本文项目来源:https://github.com/jadore801120/attention-is-all-you-need-pytorch
1.进入项目文件目录,~/Desktop/attention-is-all-you-need-pytorch-master
激活虚拟环境,source activate env3
显示(env3) ,~/Desktop/attention-is-all-you-need-pytorch-master,并始终在这个目录下进行下述操作
2.下载Some useful tools
显示(env3) ,~/Desktop/attention-is-all-you-need-pytorch-master
粘贴
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.de
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en
sed -i "s/$RealBin\/..\/share\/nonbreaking_prefixes//" tokenizer.perl
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl
3.Download the data.
显示(env3) ,~/Desktop/attention-is-all-you-need-pytorch-master
粘贴
mkdir -p data/multi30k
wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz && tar -xf training.tar.gz -C data/multi30k && rm training.tar.gz
wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz && tar -xf validation.tar.gz -C data/multi30k && rm validation.tar.gz
wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz && tar -xf mmt16_task1_test.tar.gz -C data/multi30k && rm mmt16_task1_test.tar.gz
出来data文件夹
出来data文件夹
4.Preprocess the data.
数据集太大,需要改小:进入data文件夹,data/multi30k中,将.de 和 .en文件另存一份,在此位置打开终端运行:
wc -l train.de查看多少行
sed -n “1,1000p” train.de > train.de1 截取取1-1000行另存到train.de1中,同理修改train.en
创建脚本并运行:
新建preprocess.sh在attention-is-all-you-need-pytorch-master 中
粘贴
for l in en de; do for f in data/multi30k/*.$l; do if [[ "$f" != *"test"* ]]; then sed -i "$ d" $f; fi; done; done
for l in en de; do for f in data/multi30k/*.$l; do perl tokenizer.perl -a -no-escape -l $l -q < $f > $f.atok; done; done
python preprocess.py -train_src data/multi30k/train.en.atok -train_tgt data/multi30k/train.de.atok -valid_src data/multi30k/val.en.atok -valid_tgt data/multi30k/val.de.atok -save_data data/multi30k.atok.low.pt
bash preprocess.sh
5.Train the model
修改train.py 中的参数,在run-edit configurations中粘贴:
-data data/multi30k.atok.low.pt -save_model trained -save_mode best -proj_share_weight -label_smoothing
运行train.py
会提示cuda gpu 没有无法运行,点dropout 修改train.py 里面device=torch.device(‘cpu’)