NLTK学习之四：文本信息抽取

最新推荐文章于 2024-04-16 20:14:05 发布

zzulp

最新推荐文章于 2024-04-16 20:14:05 发布

阅读量2.7w

点赞数 13

分类专栏： NLP ML 文章标签：信息抽取 nltk

本文链接：https://blog.csdn.net/zzulp/article/details/77414113

版权

ML 同时被 2 个专栏收录

15 篇文章 5 订阅

订阅专栏

NLP

11 篇文章 5 订阅

订阅专栏

1 信息抽取

从数据库中抽取信息是容易的，但对于从自然文本中抽取信息则不那么直观。通常信息抽取的流程如下图：
信息抽取流程
它开始于分句，分词。接下来进行词性标注，识别其中的命名实体，最后使用关系识别搜索相近实体间的可能的关系。

2 分块

分块是实体识别(NER)使用的基本技术，词性标注是分块所需的最主要信息。本节以名词短语(NP)为例，展示如何分块。类似的还可以对动词短语，介词短语等进行分块。下图展示了NP分块的概念。
分块示意图
分块可以简单的基于经验，使用正则表达式来匹配，也可以使用基于统计的分类算法来实现。主节先介绍NLTK提供的正则分块器。

2.1 基于正则的匹配

NLTK提供了一个基于词性的正则解析器RegexpParser，可以通过正则表达式匹配特定标记的词块。每条正则表达式由一系列词性标签组成，标签以尖括号为单位用来匹配一个词性对应的词。例如<NN>用于匹配句子中出现的名词，由于名词还有细分的如NNP,NNS等，可以用<NN.*>来表示所有名词的匹配。下面的代码演示了匹配上图中冠词-形容词-名词构成的短语块。

import nltk

sent = sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

grammer = 'NP:{<DT>*<JJ>*<NN>+}'
cp = nltk.RegexpParser(grammer)
tree = cp.parse(sent)

print tree
tree.draw()

词性标注树

2.2 处理递归

为了支持语言结构的递归，匹配规则是支持引用自身的，如下面的代码，先定义了NP的规则，而在VP和CLAUSE的定义中，互相进行了引用。

import nltk

grammar = r"""
NP: {<DT|JJ|NN.*>+} 
PP: {<IN><NP>} 
VP: {<VB.*><NP|PP|CLAUSE>+$}
CLAUSE: {<NP><VP>}
"""
cp = nltk.RegexpParser(grammar，loop=2)
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]

cp.parse(sentence)

3 基于分类的分块器

本节将使用nltk.corpus的conll2000语料来训练一个分块器。conll语料使用IOB格式对分块进行了标注，IOB是Inside,Outside,Begin的缩写，用来描述一个词与块的关系，下图是一个示例。
IOB分块边界

语料库中有两个文件:train.txt,test.txt。另外语料库提供了NP，VP和PP的块标注类型。下表对此语料类的方法进行解释：

方法	作用
tagged_sents(fileid)	返回词性标注的句子列表，列表元素(word,pos_tag)
chunked_sents(fileid,chunk_types)	返回IOB标记的语树tree，树的节点元素(word,pos_tag,iob_tag)

下表对nltk.chunk包提供工具方法进行介绍：

方法	作用
tree2conlltags(tree)	将conll IOB树转化为三元列表
conlltags2tree(sents)	上面方法的逆，将三元组列表转为树

下面的代码使用最大熵分类器训练一个iob标记分类器,然后利用标记进行分块。分类器的训练数据格式为((word,pos_tag),iob_tag)，经过学习，分类器就可以对新见到的(word,pos_tag)对进行iob分类，从而打上合适的标签。

import nltk
from nltk.corpus import conll2000

# define feature base on pos and prevpos 
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    return {"pos": pos, "prevpos": prevpos}

# A tagger based on classifier uses pos info
class ContextNPChunkTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history)
                train_set.append((featureset, tag))
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

#wrap tagger to tag sentence
class ContextNPChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        tagged_sents = [[((w, t), c) for (w, t, c) in nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ContextNPChunkTagger(tagged_sents)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w, t, c) for ((w, t), c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

chunker = ContextNPChunker(train_sents)
print(chunker.evaluate(test_sents))

''' output
ChunkParse score:
    IOB Accuracy:  93.6%%
    Precision:     82.0%%
    Recall:        87.2%%
    F-Measure:     84.6%%
'''

4 命名实体识别

命名实体识别系统的目标是识别文字提及的命名实体。可以分解成两个子任务：确定NE的边界和确定其类型。
命名实体识别也是适合基于分类器类型的方法来处理。通常标注语料库会标注下列的命名实体：['LOCATION', 'ORGANIZATION', 'PERSON', 'DURATION','DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE', 'FACILITY', 'GPE']
NLTK提供了一个训练好的NER分类器，nltk.chunk.named_entify.py源码展示了基于ace_data训练一个命名实体识注器的方法。浏览源码 :-)
下面代码使用nltk.chunk.ne_chunk()进行NE的识别。

import nltk

tagged = nltk.corpus.brown.tagged_sents()[0]
entity = nltk.chunk.ne_chunk(tagged)
print entity

5 关系抽取

一旦文本中的命名实体被识别，就可以提取其间的关系，通常是寻找所有 $(e_1,relation,e_2)$ 形式的三元组。

在nltk.sem.extract.py中实现对语料库ieer,ace,conll2002文本的关系提取。所以下面的代码可以使用正则表达式r'.*\bpresident\b'来提取某组织主席(PER president ORG)的信息。

import re
import nltk
def open_ie():
    PR = re.compile(r'.*\president\b')
    for doc in nltk.corpus.ieer.parsed_docs():
        for rel in nltk.sem.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=PR):
            return nltk.sem.rtuple(rel)

print open_ie()
'''output
[PER: u'Kiriyenko'] u'became president of the' [ORG: u'NORSI']
[PER: u'Bill Gross'] u', president of' [ORG: u'Idealab']
[PER: u'Abe Kleinfield'] u', a vice president at' [ORG: u'Open Text']
[PER: u'Kaufman'] u', president of the privately held' [ORG: u'TV Books LLC']
[PER: u'Lindsay Doran'] u', president of' [ORG: u'United Artists']
[PER: u'Laura Ziskin'] u', president of' [ORG: u'Fox 2000']
[PER: u'Tom Rothman'] u', president of production at' [ORG: u'20th Century Fox']
[PER: u'John Wren'] u', the president and chief executive at' [ORG: u'Omnicom']
[PER: u'Ken Kaess'] u', president of the' [ORG: u'DDB Needham']
[PER: u'Jack Ablin'] u', president of' [ORG: u'Barnett Capital Advisors Inc.']
[PER: u'Lloyd Kiva New'] u', president emeritus of the' [ORG: u'Institute of American Indian Art']
[PER: u'J. Jackson Walter'] u', who served as president of the' [ORG: u'National Trust for Historic Preservation']
[PER: u'Bill Gamba'] u', senior vice president and manager of bond trading at' [ORG: u'Cowen &AMP; Co.']
'''

zzulp

关注

13
点赞
踩
77

收藏

觉得还不错? 一键收藏
1
评论
NLTK学习之四：文本信息抽取

1 信息抽取从数据库中抽取信息是容易的，但对于从自然文本中抽取信息则不那么直观。通常信息抽取的流程如下：它开始于分句，分词，接下来进行词性标注，识别其中的命名实体，最后使用关系识别搜索相近实体间的可能的关系。
复制链接

扫一扫