《精通Python自然语言处理( Deepti Chopra)》读书笔记(第四章):词性标注

《精通Python自然语言处理》

Deepti Chopra(印度)
王威 译


第四章 词性标注:单词识别

词性(Parts-of-speech,POS)标注被定义为将特定的词性标记分配给句中的每一个单词的过程。


4.1词性标注简介

词性标注例子:(词性标注器存在于nltk.tag包中并被TaggerIbase类所继承)
import nltk
text1=nltk.word_tokenize("It is a pleasant day today")
print(nltk.pos_tag(text1))
Penn Treebank提供一些可用的标记列表:

http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

NNS标记信息:
nltk.help.upenn_tagset('NNS')
查询一个正则表达式:
nltk.help.upenn_tagset('VB.*')
通过词性标注来实现词义消歧:
import nltk
text=nltk.word_tokenize("I cannot bear the pain of bear")
print(nltk.pos_tag(text))
使用str2tuple()来创建一个元组:
import nltk
taggedword=nltk.tag.str2tuple('bear/NN')
print(taggedword)
print(taggedword[0])
print(taggedword[1])
用给定的文本生成元组序列:
import nltk
sentence = '''The/DT sacred/VBN Ganga/NNP flows/VBZ in/IN this/DT region/NN ./. This/DT is/VBZ a/DT pilgrimage/NN ./. People/NNP from/IN all/DT over/IN the/DT country/NN visit/NN this/DT place/NN ./. '''
print([nltk.tag.str2tuple(t) for t in sentence.split()])
将元组转换为一个单词和一个标记:
import nltk
taggedtok = ('bear', 'NN')
from nltk.tag.util import tuple2str
print(tuple2str(taggedtok))
Treebank中常用标记出现的频率:
import nltk
from nltk.corpus import treebank
treebank_tagged = treebank.tagged_words(tagset='universal')
tag = nltk.FreqDist(tag for (word, tag) in treebank_tagged)
print(tag.most_common())
出现在一个名词标记之前的标记数量:
import nltk
from nltk.corpus import treebank	
treebank_tagged = treebank.tagged_words(tagset='universal')
tagpairs = nltk.bigrams(treebank_tagged)
preceders_noun = [x[1] for (x, y) in tagpairs if y[1] == 'NOUN']
freqdist = nltk.FreqDist(preceders_noun)
print([tag for (tag, _) in freqdist.most_common()])
用字典来创建一个元组:
import nltk
tag={}
print(tag)
tag['beautiful']='ADJ'
print(tag)
tag['boy']='N'
tag['read']='V'
tag['generously']='ADV'
print(tag)
4.1.1默认标记

默认标记是一种为所有标识符分配相同词性标记的标注。

DefaultTagger类的工作原理:
import nltk
from nltk.tag import DefaultTagger
tag = DefaultTagger('NN')
print(tag.tag(['Beautiful', 'morning']))

nltk.tag.untag()函数将已标记的句子转换为为标记的句子。

取消句子标记:
import nltk
from nltk.tag import untag
print(untag([('beautiful', 'NN'), ('morning', 'NN')]))

4.2创建词性标注语料库

生成一个数据目录:
import nltk
import os,os.path
create = os.path.expanduser('~/nltkdoc')
if not os.path.exists(create):
	os.mkdir(create)
print(os.path.exists(create))
import nltk.data
print(create in nltk.data.path)
加载子目录中一个文本文件:
import nltk.data
print(nltk.data.load('nltkcorpora/important/firstdoc.txt',format='raw'))
用于生成文件长度:
import nltk
from nltk.corpus import names
names.fileids()
print(len(names.words('male.txt')))
print(len(names.words('female.txt')))
描述英文文件所包含的单词数量:
import nltk
from nltk.corpus import words
print(words.fileids())
print(len(words.words('en')))
print(len(words.words('en-basic')))
定义Maxent Teebank词性标注器:
def pos_tag (tok) :
标注一个标识符列表:
from nltk. tag import pos tag
from nltk. tokenize import word tokenize
print(pos tag (word_ tokenize ("Papa's favourite hobby is reading.")))
“””
:param tokens: list of tokens that need to be tagged
:type tok: list(str)
:return: The tagged tokens
: rtype: list (tuple(str, str))
“””
tagger = load(_POS_TAGGER)
return tagger.tag (tok)
def batch_POS_tag (sent) :
	#We can use part of speech tagger given by NLTK to perform taggingof list of tokens.
	tagger = load(_POS_TAGGER)
	return tagger.batch_tag (sent)

4.3选择一种机器学习算法

词性标注也被称为词义消歧或语法标注。
词性标注算法可分为两种类型:

  • 基于规则的(rule-based)
  • 基于随机( stochastic/probabilistic)

二阶分类器:
词性分类器将文档作为输入并获取单词的特征。她借助这些与已经可用的训练标签相结合的单词特征来训练自己。
三元词性标注器依赖于二元词性标注器,二元词性标注器依赖于一元词性标注器。

使用FastBrillTagger(基于一元语法的词性标记信息的字典):
from nltk.tag import UnigramTagger
from nltk.tag import FastBrillTaggerTrainer
from nltk.tag.brill import SymmetricProximateTokensTemplate
from	 nltk.tag.brill import ProximateTokensTemplate
from nltk.tag.bri1l import ProximateTagsRule
from n1tk.tag.brill import ProximatewordsRule

ctx = [ # Context = surrounding words and tags.
	SymmetricProximateTokensTemplate(ProximateTagsRule,(1,1)),
	SymmetricProximateTokensTemplate(ProximateTagsRule,(1,2)),
	SymmetricProximateTokensTemplate(ProximateTagsRule,(1,3)),	SymmetricProximateTokensTemplate(ProximateTagsRule,(2,2)),
	SymmetricProximateTokensTemplate(ProximatewordsRule,(0,0)),	SymmetricProximateTokensTemplate(ProximateWordsRule,(1,1)),
	SymmetricProximateTokensTemplate(ProximateWordsRule, (1,2)),
	ProximaterokensTemplate(ProximateTagsRule, (-1, -1), (1, 1)) ,
]
tagger = UnigramTagger (sentences)
tagger = FastBrillTaggerTrainer (tagger, ctx, trace=0)
tagger = tagger.train (sentences, max_rules=100)

在监督式分类中,使用了一个包含单词及其正确标记的训练语料库。在非监督分类中,不存在任何单词对和一个正确的标记列表。


4.4涉及n-gram的统计建模

用于执行UnigramTagger训练:
import nltk
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
training= treebank.tagged_sents()[:7000]
unitagger=UnigramTagger(training)
print(treebank.sents()[0])
print(unitagger.tag(treebank.sents()[0]))
计算UnigramTagger准确性:
import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
training= treebank.tagged_sents()[:7000]
unitagger=UnigramTagger(training)
testing = treebank.tagged_sents()[2000:]
print(unitagger.evaluate(testing))
使用UnigramTagger进行词性标记:
import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
unitag = UnigramTagger(model={'Vinken': 'NN'})
print(unitag.tag(treebank.sents()[0]))
评估UnigramTagger:
unitagger = UnigramTagger(training,cutoff=5)
print(unitagger.evaluate(testing))
DefaultTagger和UnigramTagger用于标注一个标识符。如果他们中的任何一个都无法标注一个单词,那么可使用下一个标注器来标注这个单词:
import nltk
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
tag1=DefaultTagger('NN')
tag2=UnigramTagger(training,backoff=tag1)
print(tag2.evaluate(testing))
BigramTagger的实现:
import nltk
from nltk.tag import BigramTagger
from nltk.corpus import treebank
training_1= treebank.tagged_sents()[:7000]
bigramtagger=BigramTagger(training_1)
print(treebank.sents()[0])
print(bigramtagger.tag(treebank.sents()[0]))
testing_1 = treebank.tagged_sents()[2000:]
print(bigramtagger.evaluate(testing_1))
BigramTagger和TrigramTagger的实现:
import nltk
from nltk.tag import BigramTagger, TrigramTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
bigramtag = BigramTagger(training)
print(bigramtag.evaluate(testing))
trigramtag = TrigramTagger(training)
print(trigramtag.evaluate(testing))
开发一个QuadgramTagger标注器:
import nltk
from nltk.corpus import treebank
from nltk import NgramTagger
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
quadgramtag = NgramTagger(4, training)
print(quadgramtag.evaluate(testing))
AffixTagger标注器:
import nltk
from nltk.tag import AffixTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
affixtag = AffixTagger(training)
print(affixtag.evaluate(testing))
学习和使用AffixTagger的4个字符和前缀:
import nltk
from nltk.tag import AffixTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
prefixtag = AffixTagger(training, affix_length=4)
print(prefixtag.evaluate(testing))
学习和使用AffixTagger的4个字符和后缀:
import nltk
from nltk.tag import AffixTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
suffixtag = AffixTagger(training, affix_length=-3)
print(suffixtag.evaluate(testing))
组合许多回退链中的词缀标注器:
import nltk
from nltk.tag import AffixTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
prefixtagger=AffixTagger(training,affix_length=4)
prefixtagger3=AffixTagger(training,affix_length=3,backoff=prefixtagger)
print(prefixtagger3.evaluate(testing))
suffixtagger3=AffixTagger(training,affix_length=-3,backoff=prefixtagger3)
print(suffixtagger3.evaluate(testing))
suffixtagger4=AffixTagger(training,affix_length=-4,backoff=suffixtagger3)
print(suffixtagger4.evaluate(testing))
TnT(Trigrams n Tags)建立在二阶马尔可夫模型的基础之上,是一个基于统计的标注器:
import nltk
from nltk.tag import tnt
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
tnt_tagger=tnt.TnT()
tnt_tagger.train(training)
print(tnt_tagger.evaluate(testing))
关于DefaultTagger:
import nltk
from nltk.tag import DefaultTagger
from nltk.tag import tnt
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
tnt_tagger=tnt.TnT()
unknown=DefaultTagger('NN')
tagger_tnt=tnt.TnT(unk=unknown,Trained=True)
tnt_tagger.train(training)
print(tnt_tagger.evaluate(testing))

4.5使用词性标注语料库开发分块器

分块是一个可以用于执行实体识别的过程。用于分割和标记句中的多个标识符序列。

通过构建分块规则来执行名词短语分块:
import nltk
sent=[("A","DT"),("wise", "JJ"), ("small", "JJ"),("girl", "NN"), ("of", "IN"), ("village", "N"),  ("became", "VBD"), ("leader", "NN")]
grammar = "NP: {<DT>?<JJ>*<NN><IN>?<NN>*}"
find = nltk.RegexpParser(grammar)
res = find.parse(sent)
print(res)
res.draw()   
用任意数量的名词构建的名词短语分块规则:
import nltk
noun1=[("financial","NN"),("year","NN"),("account","NN"),("summary","NN")]
gram="NP:{<NN>+}"
find = nltk.RegexpParser(gram)
print(find.parse(noun1))
x=find.parse(noun1)
x.draw()
UnigramChunker用来执行分块和解析:
class UnigramChunker (nltk.ChunkParserI):
	def _init_ (self, training) :
		training_data=[[(x,y) for P,x,Y in nltk.chunk.treeconlltags (sent)] for sent in training]
		self.tagger=nltk.UnigramTagger (training data)
	def parsing (self, sent) :
	postags=[pos1 for (word1,pos1) in sent]
	tagged_postags=self.tagger.tag (postags)
	chunk_tags= [chunking for (pos1, chunktag) in tagged postags]
	conll_tags=[ (word, pos1, chunktag) for ( (word, pos1) , chunktag) in zip (sent,  chunk tags) ]
	return nltk.chunk.conlltaags2tree (conlltags)
评估分块器在训练之后的准确度:
import  nltk.corpus, nltk.tag
def ubt_conll chunk_accuracy(train_sents, test_sents):
	chunks train =conll_tag_chunks (training)
	chunks test =conll_tag_chunks (testing)
	chunker1 =nltk.tag.UnigramTagger((chunks_train)
	print 'u:', nltk. tag.accuracy (chunker1, chunks_test)

	chunker2 =nltk. tag.BigramTagger (chunks_train, backoff=chunker1)
	print 'ub:', nltk.tag.accuracy (chunker2, chunks_test)

	chunker3 =nltk.tag.TrigramTagger (chunks_train, backoff=chunker2)
	print 'ubt:', nltk.tag.accuracy (chunker3, chunks_test)

	chunker4 =nltk.tag.TrigramTagger (chunks_train, backoff=chunker1)
	print 'ut:', nltk.tag.accuracy (chunker4, chunks_test)

	chunker5 =nltk.tag.BigramTagger (chunks_train, backoff=chunker4)
    	print 'utb:', nltk.tag.accuracy (chunker5, chunks_test)

# accuracy test for conll chunking
Conll_train =nltk.corpus.conll2000.chunked_sents(' train. txt')
conll_test =nltk. corpus.con112000.chunked_sents('test. txt')
ubt_conll_chunk_accuracy (conll_train, conll_test)

# accuracy test for treebank chunking
treebank_sents =nltk.corpus.treebank_chunk.chunked_sents ()
ubt_conll_chunk_accuracy (treebank_sents[:2000], treebank_sents [2000:])

“”"***笔者的话:整理了《精通Python自然语言处理》的第四章内容:词性标注。个人感觉,词性标注在中文分词中的作用还是比较重要的。如何能正确判别词在句中的词性对之后的中文摘要、关键词提取都是至关重要的。本博客记录了书中的每段代码。希望对阅读这本书的人有所帮助。FIGHTING...(热烈欢迎大家批评指正,互相讨论)
(A choice each must make for themselves, something no hero will ever defeat.)
***"""

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值