OpenNLP使用小结

最新推荐文章于 2024-08-07 10:15:12 发布

robinliu2010

最新推荐文章于 2024-08-07 10:15:12 发布

阅读量1.3w

点赞数 1

分类专栏： NLP JAVA 文章标签： token string tags 工具工作

JAVA 同时被 2 个专栏收录

68 篇文章 0 订阅

订阅专栏

NLP

6 篇文章 0 订阅

订阅专栏

http://danielmclaren.com/2007/05/11/getting-started-with-opennlp-natural-language-processing

OpenNLP使用小结

我刚刚开始接触NLP，最近使用了一下开源工具包OpenNLP。它包含sentence detector, parts-of-speech (POS) tagger和treebank parser。本文主要对我这段时间来使用OpenNLP的一些经验技巧做一下小结。

OpenNLP能做什么？

以下面一段句子为例，我们来看看OpenNLP到底可以做一些什么工作: This isn't the greatest example sentence in the world because I've seen better. Neither is this one. This one's not bad, though.

Sentence Detector
简单直观的理解就是提取句子。但是可能没有我们想象的那么简单，因为有些句子不是以句号结尾，尤其对一些对话文本可能会更加复杂。幸运的是OpenNLP为我们提供了一个提取句子结构的模块。Sentence Detector是所有其他操作的一个先行步骤，因为其他操作一次只能处理一个sentence。
Sentence Detector返回String数组，在这里，返回的第一个数组如下：
This isn't the greatest example sentence in the world because I've seen better.
Tokenizer
POS tagger和Treebank parser都需要将句子分解成tokens。通常一个单词是一个token，但是，有些单词需要分解成两个tokens。例如，"don't"分解成"do"和"n't"这两个tokens。下面是一个句子的分解：
This is n't the greatest example sentence in the world because I 've seen better .
POS Tagger
就是对句子进行语法结构分析，将每个token对应一个speech tags (verb, adverb, personal pronoun)。下面是tagging的结果：
This/DT is/VBZ n't/RB the/DT greatest/JJS example/NN sentence/NN in/IN the/DT world/NN because/IN I/PRP 've/VBP seen/VBN better/RB ./.
可以参考这篇文章理解POS。
Treebank Chunker
将句子分块chunking。名词phrase和动词phrase可以被正确的标记。在我们的例子中，我们可以得到如下的chunks：
[NP This/DT ] [VP is/VBZ ] n't/RB [NP the/DT greatest/JJS example/NN sentence/NN ] [PP in/IN ] [NP the/DT world/NN ] [SBAR because/IN ] [NP I/PRP ] [VP 've/VBP seen/VBN ] [ADVP better/RB ] ./.
Treebank Parser
构建语法结构树