#第二章 Sentence Detector#
##Sentence Detection##
Apache OpenNLP Sentence Detector能够检测到一个句子中的标点符号是否标记着句子的末尾。在这个意义上,一个句子被定义为由两个标点符号标记的最长的空格修整的字符序列。第一个和最后一个句子违背了这个原则。第一个没有空格的字符被假定为一个句子的开始,最后一个没有空格的字符被假定为句子的结尾。下面的示例文本应该被分割成:
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is
chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years
old and former chairman of Consolidated Gold Fields PLC, was named a director of this
British industrial conglomerate.
在检测句子的边界后每一个句子被写入他自己的行中。
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC,
was named a director of this British industrial conglomerate.
通常,Sentence Detection在句子被标记(tokenized)之前完成,并且在网站上的预训练模型(pre-trained models)是以这种方式被训练的,但是也可以首先执行tokenization,让Sentence Detector处理已经tokenized的文本。OpenNLP Sentence Detector无法识别基于句子内容的句子边界。一个突出的例子是,一篇文章中的第一个句子(标题)被错误的识别为第一句的第一部分。大多数OpenNLP中的组件期望输入被分割成很多子句。
###Sentence Detection Tool###
最简单的使用Sentence Detector的方式是命令行工具。这个工具仅仅用于示范和测试。下载英文sentence detector 模型,并且使用下面的命令启动Sentence Detector:
$ opennlp SentenceDetector en-sent.bin
仅仅拷贝上面的示例文本到控制台。Sentence Detector将会读取它,并且每行一个句子输出到控制台。通常,输入从一个文件读取,并且输出重定向到另一个文件。可以通过下面的命令完成.
$ opennlp SentenceDetector en-sent.bin < input.txt > output.txt
对于来自网站的英文sentence模型,这个输入文本应该不会被tokenized。
###Sentence Detection API###
Sentence Detector可以很方便的通过他的API集成到一个应用中。在实例化Sentence Detector之前,必须首先加载sentence 模型。
InputStream modelIn = new FileInputStream("en-sent.bin");
try {
SentenceModel model = new SentenceModel(modelIn);
}
catch (IOException e) {
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}
在模型加载后,可以实例化SentenceDetectorME 。
SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
Sentence Detector可以输出一个String数组,数组中每一个元素都是一个句子。
String sentences[] = sentenceDetector.sentDetect(" First sentence. Second sentence. ");
这个结果数组包括两个记录。第一个String是”First sentence.“ ,第二个String是"Second sentence."。在输入的String之前,之后,中间的空格都被移除。这个API也提供了一个方法,他简单的返回了这个输入String中的句子的span。
Span sentences[] = sentenceDetector.sentPosDetect(" First sentence. Second sentence. ");
这个结果数组也包括两个记录。第一个span开始于索引2,结束于17。第二个span开始于18,结束于34。公共方法Span.getCiveredText 可以创建一个子String,它仅仅包含这个span中的字符。
##Sentence Detector Training##
###Training Tool###
OpenNLP有一个命令行工具,用于训练从模型下载页面得到的不同语料库的模型。数据必须转换成OpenNLP Sentence Detector 训练格式。他是每行一个句子。一个空行表示一个文档的边界。如果该文件的边界是未知的,推荐每隔数十行有一个空行。就像上面的示例中输出的一样。工具使用方法:
$ opennlp SentenceDetectorTrainer
Usage: opennlp SentenceDetectorTrainer[.namefinder|.conllx|.pos] [-abbDict path] \
[-params paramsFile] [-iterations num] [-cutoff num] -model modelFile \
-lang language -data sampleData [-encoding charsetName]
Arguments description:
-abbDict path
abbreviation dictionary in XML format.
-params paramsFile
training parameters file.
-iterations num
number of training iterations, ignored if -params is used.
-cutoff num
minimal number of times a feature must be seen, ignored if -params is used.
-model modelFile
output model file.
-lang language
language which is being processed.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
训练一个English sentence detector 使用下面的命令:
$ opennlp SentenceDetectorTrainer -model en-sent.bin -lang en -data en-sent.train -encoding UTF-8
它应该会产生下面的输出:
Indexing events using cutoff of 5
Computing event counts... done. 4883 events
Indexing... done.
Sorting and merging events... done. Reduced 4883 events to 2945.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 2945
Number of Outcomes: 2
Number of Predicates: 467
...done.
Computing model parameters...
Performing 100 iterations.
1: .. loglikelihood=-3384.6376826743144 0.38951464263772273
2: .. loglikelihood=-2191.9266688597672 0.9397911120212984
3: .. loglikelihood=-1645.8640771555981 0.9643661683391358
4: .. loglikelihood=-1340.386303774519 0.9739913987302887
5: .. loglikelihood=-1148.4141548519624 0.9748105672742167
...<skipping a bunch of iterations>...
95: .. loglikelihood=-288.25556805874436 0.9834118369854598
96: .. loglikelihood=-287.2283680343481 0.9834118369854598
97: .. loglikelihood=-286.2174830344526 0.9834118369854598
98: .. loglikelihood=-285.222486981048 0.9834118369854598
99: .. loglikelihood=-284.24296917223916 0.9834118369854598
100: .. loglikelihood=-283.2785335773966 0.9834118369854598
Wrote sentence detector model.
Path: en-sent.bin
###Training API###
Sentence Detector也提供了一个API来训练一个新的sentence detection model。要训练,主要需要三步:
- 应用程序必须开放一个示例数据流
- 调用SentenceDetectorME.train方法
- 保存SentenceModel到一个文件或者直接使用它
下面的示例代码阐述了这三步:
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream =
new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset);
ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream);
SentenceModel model;
try {
model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
}
finally {
sampleStream.close();
}
OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
##Evaluation##
###Evaluation Tool###
这个命令展示了evaluator工具是怎样运行的。
$ opennlp SentenceDetectorEvaluator -model en-sent.bin -lang en -data en-sent.eval -encoding UTF-8
Loading model ... done
Evaluating ... done
Precision: 0.9465737514518002
Recall: 0.9095982142857143
F-Measure: 0.9277177006260672
en-sent.eval文件和训练数据有着相同的格式。