信息提取(Information Extraction)
下图显示了一个简单的信息提取系统的结构。
首先,使用句子分割器将文档的原始文本分割成句,使用分词器将每个句子进一步细分为词。接下来,对每个句子进行词性标注,在下一步,命名实体识别中我们将寻找句子中提到的实体;最后,使用关系识别搜索文本中不同实体间的可能关系。
对于前三步,我们可以定义一个函数:
>>> def ie_preprocess(document):
... sentences = nltk.sent_tokenize(document)
... sentences = [nltk.word_tokenize(sent) for sent in sentences]
... sentences = [nltk.pos_tag(sent) for sent in sentences]
分块(Chunking)
用正则表达式分块(Chunking with Regular Expressions)
正则表达式的格式为
r”“”
块名:{<表达式>…<>}
{…}
“””
一个简单的名词短语分块器:
grammar = r"""
NP: {<DT|PP\$>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and noun
{<NNP>+} # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"),("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
>>> print(cp.parse(sentence))
(S
(NP Rapunzel/NNP)
let/VBD
down/RP
(NP her/PP$ long/JJ golden/JJ hair/NN))
大括号内为分块规则,可以有一个或多个,当rule不止一个时,RegexpParser会依次调用各个规则,并不断更新分块结果,直到所有的rule都被调用。
nltk.RegexpParser(grammar)用于依照分块规则创建一个chunk分析器,cp.parse()则在目标句子中运行分析器,最后的结果是一个树结构,我们可以用print打印它。
再看一个例子:
>>> cp = nltk.RegexpParser('CHUNK: {<V.*> <TO> <V.*>}')
>>> brown = nltk.corpus.brown
>>> for sent in brown.tagged_sents():
... tree = cp.parse(sent)
... for subtree in tree.subtrees():
... if subtree.label() == 'CHUNK': print(subtree)
...
(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)
...
(CHUNK seems/VBZ to/TO overtake/VB)
(CHUNK want/VB to/TO buy/VB)
加缝隙(Chinking)
有时定义我们想从一个块排除什么比较容易。我们可以为不包括在一大块中的一个标识符序列定义一个 缝隙。这种表达式的格式为:‘ }表达式{ ’ 。在下面的例子中,barked/VBD at/IN 是一个缝隙:
[ the/DT little/JJ yellow/JJ dog/NN ] barked/VBD at/IN [ the/DT cat/NN ]
加缝隙是从一大块中去除一个标识符序列的过程。有三种情况:
1. 如果匹配的标识符序列贯穿一整块 ,那么这一整块会被去除。
2. 如果标识符序列出现在块中间,这些标识符会被去除,在以前只有一个块的地方留下两个块。
3. 如果序列在块的两边,这些标记被去除,留下一个较小的块。
下表展示了这三种情况:
’ | Entire chunk | Middle of a chunk | End of a chunk |
---|---|---|---|
Input | [a/DT little/JJ dog/NN] | [a/DT little/JJ dog/NN] | [a/DT little/JJ dog/NN] |
Operation | Chink “DT JJ NN” | Chink “JJ” | Chink “NN” |
Pattern | }DT JJ NN{ | }JJ{ | }NN{ |
Output | a/DT little/JJ dog/NN | [a/DT] little/JJ [dog/NN] | [a/DT little/JJ] dog/NN |
例子:
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|IN>+{ # Chink sequences of VBD and IN
"""
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
>>> print(cp.parse(sentence))
(S
(NP the/DT little/JJ yellow/JJ dog/NN)
barked/VBD
at/IN
(NP the/DT cat/NN))