1.工作环境:Centos6.1_X86_64,四核2.0GHz,内存4GB,交换内存30G,linux内核版本2.6.32-131.21.1.el6.x86_64
2.准备工作
2.1建立目录1srilm,2gizapp,3moses,4data,5backup,6bin,7model,8tuning
2.2下载程序
moses:https://nodeload.github.com/moses-smt/mosesdecoder/zipball/master
giza++:svncheckout http://giza-pp.googlecode.com/svn/trunk/ gizapp
srilm:http://www.speech.sri.com/cgi-bin/uncgi/srilm.tgz
2.3编译2gizapp:
进入gizapp的目录,进入GIZA++-v2:
[qibaoyuan@SL400gizapp]$ cd ..
[qibaoyuan@SL400assignment1]$ mkdir 6bin
[qibaoyuan@SL400assignment1]$ cp gizapp/GIZA++-v2/GIZA++ 6bin/
[qibaoyuan@SL400assignment1]$ cp gizapp/mkcls-v2/mkcls 6bin/
[qibaoyuan@SL400assignment1]$ cp gizapp/GIZA++-v2/snt2cooc.out 6bin/
2.4编译srilm:
修改Makefile中的STILM的路径为自己电脑上srilm包的位置:SRILM=/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/srilm
make
进入mose的文件夹,执行./regenerate-makefiles.sh;./configure--with-srilm=/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/1srilm;make-j 4
2.5编译moses:
进入mose的scripts目录,修改makefile的13-14行为,target为编译的结果,bin存放GIZA++ mkcls snt2cooc.out:
TARGETDIR=/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/6bin BINDIR=/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/6bin |
由于我自己笔记本上的autoconf的版本较低,因此需要修改scripts/training/phrase-extract/extract-ghkm的configure.ac的开始位置的设置:AC_PREREQ([2.63])
在scripts中执行makerelease生成发布的程序,编译好以后:
## Remember, only files listedin released-files are released!! ## Don't forget to set yourSCRIPTS_ROOTDIR with: exportSCRIPTS_ROOTDIR=/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/6bin/scripts-20111209-1602 |
3.原始数据处理
-
处理原始数据,将原始的30w的双语语料进行分割,分别生成train_utf8.xml.en和train_utf8.xml.ch。验证数据:wc-l file,保证为30w行
将文件移动到4data下将中文和英文分别重命名为TrainCh.txt和TrainEn.txt。
-
进行分词与tokenize:使用ictclas进行汉语语料的分词,使用自己搭建的分词服务器,调用ictclas5.0分词工具,生成分词后的结果文件chinese_seg.txt;采用giza自带的tokenizer进行token的处理生成english_tok.txt:/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/3moses/scripts/tokenizer/tokenizer.perl-l en < TrainEn.txt >english_tok.txt
将这两个结果存放在4data文件夹下面。
-
将中文分词和英文stemming以后的文件改名分别改名为raw.chn和raw.eng,便于进行mose生成mose格式的训练语料。
-
4.数据训练
4.1进入mose的training目录,执行数据清理工作(去掉过长的句子):
[qibaoyuan@SL400 training]$$SCRIPTS_ROOTDIR/training/clean-corpus-n.perl/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/4data/rawchn eng clean 1 100
clean-corpus.perl: processing /home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/4data/raw.chn& .eng to clean, cutoff 1-100 ..........(100000)..........(200000)..........(300000) Input sentences: 300000 Outputsentences: 299936 |
上面表明在30w的语料中有64个句子的长度超过了100,因此被删除掉了。这将在scripts目录生成mose需要的clean.chn和clean.eng,分别对应着训练的中文和英文语料。
将这两个文件复制到4data文件夹。
将英文小写:
/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/3moses/scripts/tokenizer/lowercase.perl</home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/4data/clean.eng>/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/4data/clean.low.eng
将原始的改名为clean.ori.eng,将小写的改为clean.eng。
4.2生成语言模型,采用N-GRAM(3-gram):
[qibaoyuan@SL400 assignment1]$1srilm/lm/bin/i686/ngram-count -order 3 -interpolate -kndiscount-unk -text 4data/clean.chn -lm 7model/chinese.o3.lm [qibaoyuan@SL400 assignment1]$1srilm/lm/bin/i686/ngram-count -order 3 -interpolate -kndiscount-unk -text 4data/clean.eng -lm 7model/english.o3.lm |
将这两个model放到7model文件夹下面。
4.3抽取双语短语:
[qibaoyuan@SL400 training]$ pwd /home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/6bin/scripts-20111209-1602/training [qibaoyuan@SL400 training]$[qibaoyuan@SL400 training]$ ./train-model.perl --scripts-root-dir/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/6bin/scripts-20111209-1602--root-dir/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/7model/training_e2c--corpus/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/4data/clean--f eng --e chn --max-phrase-length 10 --alignment-factors 0-0--translation-factors 0-0 --reordering msd-fe --reordering-factors0-0 --generation-factors 0-0 --lm0:3:/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/7model/chinese.o3.lm |
将在training_e2c文件夹下顺序生成corpus,giza.eng-chn,giza.chn-eng,model文件夹,在我的笔记本上耗时近5个小时。
5.最小错误率训练:
5.1去除xml标记,首先处理e2c。
5.2进行ictclas50的分词生成e2c_dev_ref.sgm.clean.txt.seg文件。
5.3文件分割,由于每个源句子有四个目标句子,而源句子总共995句,因此每995句一行:
[qibaoyuan@SL400dev_e2c]$[qibaoyuan@SL400 dev_e2c]$ split -l 995e2c_dev_ref.sgm.clean.txt.seg e2c_dev_ref.sgm.clean.txt.seg~ |
生成e2c_dev_ref.sgm.clean.txt.seg~0-e2c_dev_ref.sgm.clean.txt.seg~3四个文件。
5.4将训练出来的model移到/home/qibaoyuan/Courses/2011Autumn/Machine_Translations/assignment1/,然后进行开发集的测试,最小错误率训练::
6.测试
测试e2c,将测试语料进行tokenize 和lower 处理,然后执行:
[qibaoyuan@SL400 assignment1]$3moses/moses-cmd/src/moses -config 7model/training/model/moses.ini-input-file 4data/tst_e2c/e2c_tst_src.sgm.clean.tok.low 1>4data/tst_e2c/e2c_tst_src.sgm.clean.tok.low.result 2>4data/tst_e2c/e2c_tst_src.sgm.clean.tok.low.decode.out |
同理,可以测试c2e。
实验测试结果:
-
BLEU
NIST
汉到英
0.1423
5.8607
英到汉
0.2034
6.1731
7.实验总结
7.1收获
(1)明白了进行机器翻译的过程,明白了moses、giza++和srim的关系,对于各个模块之间的互相合作具有较深的理解
(2)32位机器的局限性,对于寻址来说,即使加大虚拟内存,而采用32位系统的话,依然无法进行扩展,因为其最大寻址范围是2^32=4GB,而64位机器的寻址范围是2^64=16GGB
7.2遇到的问题:
(1)tuning的时候,无法写入extractor.sh:解决办法采用绝对路径
(2)编译srilm的时候出现gnu-stubs-32.h照不到,解决办法修改sbin下边的machine-type为x86_64
(3)由于数据量太大,导致在32位机器山内存不足,因此在64位机器上,将交换内存设置的大一些,本机设置为30G