tokenizer.perl
是统计机器翻译系统moses的一个小工具,可以用来对英文德文等进行分词。
使用方法:
$ perl tokenizer.perl -l en < [待分词文件] > [分词结果]
其中: -l en 表示的输入的文件是英文
例如:
$ perl tokenizer.perl -l en < train.en > train.tok.en
参数说明:
if ($HELP)
{
print "Usage ./tokenizer.perl (-l [en|de|...]) (-threads 4) < textfile > tokenizedfile\n";
print "Options:\n";
print " -q ... quiet.\n";
print " -a ... aggressive hyphen splitting.\n";
print " -b ... disable Perl buffering.\n";
print " -time ... enable processing time calculation.\n";
print " -penn ... use Penn treebank-like tokenization.\n";
print " -protected FILE ... specify file with patters to be protected in tokenisation.\n";
print " -no-escape ... don't perform HTML escaping on apostrophy, quotes, etc.\n";
exit;
}
解释:
不指定任何参数的话会默认认为是英文,同时把标点分开,把引号转成 &apos。但是连字符是不进行分割的。
$ echo "A Republican 'strategy' to counter the re-election of Obama." | perl ~/script/mosesdecoder/scripts/tokenizer/tokenizer.perl
>>>A Republican ' strategy ' to counter the re-election of Obama .
-l : 指定的是语言,一共支持哪些语言我也不太清楚,只知道英语和德语
-a:会把连字符的单词分开,同时也会把标点符号分开,例如:
$ echo "A Republican strategy to counter the re-election of Obama." | perl ~/script/mosesdecoder/scripts/tokenizer/tokenizer.perl -a
>>> A Republican strategy to counter the re @-@ election of Obama .
-no-escape:会只分开标点,连字符和引号都不进行转义:
$ echo "A Republican 'strategy' to counter the re-election of Obama." | perl ~/script/mosesdecoder/scripts/tokenizer/tokenizer.perl -no-escape
>>>A Republican ' strategy ' to counter the re-election of Obama .
暂时记录到这里,遇到新问题会补充。
ps:
# detokenizer
cat train.en | perl ~/script/mosesdecoder/scripts/tokenizer/detokenizer.perl -threads 40 > train.raw.en
# tokenizer
perl ~/script/mosesdecoder/scripts/tokenizer/tokenizer.perl -l en -no-escape < train.raw.en > train.tok.en