python自然语言处理第七章

1.句子分割器,分词器和词性标注器。
def ie_preprocess(document):
sentences=nltk.sents_tokenize(document)
sentences=[nltk.word_wokenizes(sent) for sent in sentences]
sentence=[nltk.pos_tag(sent) for sent in sentences]
2.一个基于正则表达式的NP分块器的例子。
 

sentences=[("the","DT"),("little","JJ"),("yellow","JJ"),

...("dog","NN"),("barked","VBD"),("at","IN"),(the","DT"),("cat","NN")]

grammar="NP:{<DT>?<JJ>*<NN>}"

cp=nltk.RegexpParser(gammar)

result=cp.parse(sentence)

print result

result.draw()

3.简单的名词短语分块器。
 

grammar=r"""
NP:{<DT|PP\$>?<JJ>*<NN>}
{<NNP>+}
"""
cp=nltk.RegexpParser(grammar)
sentence=[("Rapunzel","NNP"),("let","VBD"),("down","RP"),("her","pp$"),
("long","JJ"),("golden","JJ"),("hair","NN")]
print cp.parse(sentence)

4.在已标注语料库中提取匹配特定词性标记序列的短语。
 

cp=nltk.RegexpParser('CHUNK:{<V.*><TO><V.*>}')
brown=nltk.corpus.brown
for sent in brown.tagged_sents():
tree=cp.parse(sent)
for subtree in tree.subtrees():
if subtree.node=='CHUNK':print subtree

5.简单评估和基准。为琐碎的不创建任何块的块分析器cp创建一个基准。
 

from nltk.corpus import conll2000

cp=nltk.RegexpParser("")

test_sents=conll2000.chunked_sents('test.txt',chunk_types=['NP'])

print cp.evaluate(test_sents)

6.使用unigram标注器对名词短语分块。
 

class UnigramChunker(nltk.ChunkParserI):
def __init__(self,train_sents):
train_data=[[(t,c) for w,t,c in nltk.chunk.tree2colltags(sent)]
for sent in train_sents]
self.tagger=nltk.UnigramTagger(train_data)
def parse(self,sentence):
pos_tags=[pos for (word,pos) in sentence]
tagged_pos_tags=self.tagger.tag(pos_tags)
chunktags=[chunktag for (pos,chunktag) in tagged_pos_tags]
chunktags=[(word,pos,chunktag) for ((word,pos),chunktag) in zip(sentence,chunktags)]
return nltk.chunktags2tree(colltags)

7.创建树状图。
 

>>>tree1=nltk.Tree('NP',['Alice'])
>>>tree2=nltk.Tree('NP',['the','rabbit'])
>>>tree3=nltk.Tree('VP',['chase',tree2])
>>>tree4=nltk.Tree('S',[tree1,tree3])
>>>print tree4
(s (NP Alice) (VP chase (NP the rabbit)))
>>>print tree4[0]
(NP Alice)
>>>print treee4[0][0]
Alice
>>>print tree4[0][0][0]
a

>>>print tree4[1]
(VP chased (NP the rabbit))
>>>tree4[1].node
'VP'
>>>tree4.leaves()
['Alice','chased','the','rabbit']
>>>tree4[1][1][1]
'rabbit'

8.递归函数遍历树状图。
 

def traverse(t):
try:
t.node
except AttributeError:
print t,
else:
#Now we know that t.node is defined
print '(',t.node,
for child in t:
traverse(child)
print ')',
>>>t=nltk.Tree('(S (NP Alice) (VP chased (NP the rabbit)))')
>>>traverse(t)
(S (NP Alice) (VP chased (NP the rabbit)))

9.nltk提供了一个已经训练好的可以识别命名实体的分类器。使用函数nltk.ne_chunk()访问。如果设置binary=True,那么命名实体只被标记为NE;否则,分类器会添加类型标签,如PERSON,ORGANIZATION,AND GPE.
 

sent=nltk.corpus.treebank.tagged_sents()[22]

print nltk.ne_chunk(sent,binary=True)

print nltk.ne_chunk(sent)

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值