声明:代码的运行环境为Python3。Python3与Python2在一些细节上会有所不同,希望广大读者注意。本博客以代码为主,代码中会有详细的注释。相关文章将会发布在我的个人博客专栏《Python自然语言处理》,欢迎大家关注。
词性标注在自然语言处理中也是很重要的一环,本篇文章将主要介绍一下相关的词性标注器,一起来看看吧~
一、词性的初测
分词不同的内容,可能用于进行词性标注标识是不同的,例如:
import nltk
from nltk.tag import pos_tag # 词性标注器
from nltk.tokenize import word_tokenize
import pickle
# 词性标注器
text = word_tokenize("And now for something completely different")
print(pos_tag(text))
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
print(pos_tag(text))
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
此时可以通过设置tagset='universal'使得词性标注统一:
text = word_tokenize("And now for something completely different")
print(pos_tag(text, tagset='universal'))
[('And', 'CONJ'), ('now', 'ADV'), ('for', 'ADP'), ('something', 'NOUN'), ('completely', 'ADV'), ('different', 'ADJ')]
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
print(pos_tag(text, tagset='universal'))
[('They', 'PRON'), ('refuse', 'VERB'), ('to', 'PRT'), ('permit', 'VERB'), ('us', 'PRON'), ('to', 'PRT'), ('obtain', 'VERB'), ('the', 'DET'), ('refuse', 'NOUN'), ('permit', 'NOUN')]
统一的词性标注说明如下:
可以根据单词的词性进行相似词性的单词搜索,例如:
text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')
man time day year car moment world house family child country boy
state job place way war girl work word
也可以对某个语料库进行标注:
# 标注语料库
tagged_token = nltk.tag.str2tuple('fly/NN') # 将字符串转化成元组
print(tagged_token)
('fly', 'NN')
示例:1.查找布朗语料库中最常见的标注信息
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal') # 查找布朗语料库最常见的标注信息
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()
tag_fd.plot()
Backend Qt5Agg is interactive backend. Turning interactive mode on.
2.查找布朗语料库中与名词挨着的词的词性频率
word_tag_pairs = list(nltk.bigrams(brown_news_tagged))
nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN').most_common()
Out[13]:
[('NOUN', 7959),
('DET', 7373),
('ADJ', 4761),
('ADP', 3781),
('.', 2796),
('VERB', 1842),
('CONJ', 938),
('NUM', 894),
('ADV', 186),
('PRT', 94),
('PRON', 19),
('X', 11)]
通过上面的例子可以发现,在布朗语料库中,与名词挨着的出现频率最大的还是名词。
二、自动标注
1、默认标注器
from nltk.corpus import brown
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = nltk.word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN') # 默认的标注器标注为nn
default_tagger.tag(tokens) # 所有的词性全部变为名词
Out[15]:
[('I', 'NN'),
('do', 'NN'),
('not', 'NN'),
('like', 'NN'),
('green', 'NN'),
('eggs', 'NN'),
('and', 'NN'),
('ham', 'NN'),
(',', 'NN'),
('I', 'NN'),
('do', 'NN'),
('not', 'NN'),
('like', 'NN'),
('them', 'NN'),
('Sam', 'NN'),
('I', 'NN'),
('am', 'NN'),
('!', 'NN')]
default_tagger.evaluate(brown.tagged_sents(categories='news')) # 计算标注的词语符合要求的占比
Out[16]: 0.13089484257215028
2、正则表达式标注器
# 正则表达式标注器
patterns = [ # 自定义模式
(r'.*ing$', 'VBG'), # gerunds
(r'.*ed$', 'VBD'), # simple past
(r'.*es$', 'VBZ'), # 3rd singular present
(r'.*ould$', 'MD'), # modals
(r'.*\'s$', 'NN$'), # possessive nouns
(r'.*s$', 'NNS'), # plural nouns
(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
(r'.*', 'NN') # nouns (default)
]
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown.sents()[3])
Out[17]:
[('``', 'NN'),
('Only', 'NN'),
('a', 'NN'),
('relative', 'NN'),
('handful', 'NN'),
('of', 'NN'),
('such', 'NN'),
('reports', 'NNS'),
('was', 'NNS'),
('received', 'VBD'),
("''", 'NN'),
(',', 'NN'),
('the', 'NN'),
('jury', 'NN'),
('said', 'NN'),
(',', 'NN'),
('``', 'NN'),
('considering', 'VBG'),
('the', 'NN'),
('widespread', 'NN'),
('interest', 'NN'),
('in', 'NN'),
('the', 'NN'),
('election', 'NN'),
(',', 'NN'),
('the', 'NN'),
('number', 'NN'),
('of', 'NN'),
('voters', 'NNS'),
('and', 'NN'),
('the', 'NN'),
('size', 'NN'),
('of', 'NN'),
('this', 'NNS'),
('city', 'NN'),
("''", 'NN'),
('.', 'NN')]
regexp_tagger.evaluate(brown.tagged_sents(categories='news')) # 计算符合要求的占比
Out[18]: 0.20326391789486245
3、查询标注器
# 查询标注器
fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
most_freq_words = fd.most_common()[:100]
likely_tags = dict((word, cfd[word].max()) for (word, freq) in most_freq_words)
baseline_tagger = nltk.UnigramTagger(model=likely_tags)
baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
Out[22]: 0.45578495136941344
三、N-gram标注
1、一元模型
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
print(unigram_tagger.tag(brown_sents[2007]))
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
print(unigram_tagger.evaluate(brown_tagged_sents))
0.9349006503968017
2、分离训练与测试数据
# 分离训练与测试数据
size = int(len(brown_tagged_sents) * 0.9)
print(size)
4160
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)
Out[29]: 0.8121200039868434
3、一般N-gram标注
# 一般N-gram标注
bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.tag(brown_sents[2007])
Out[30]:
[('Various', 'JJ'),
('of', 'IN'),
('the', 'AT'),
('apartments', 'NNS'),
('are', 'BER'),
('of', 'IN'),
('the', 'AT'),
('terrace', 'NN'),
('type', 'NN'),
(',', ','),
('being', 'BEG'),
('on', 'IN'),
('the', 'AT'),
('ground', 'NN'),
('floor', 'NN'),
('so', 'CS'),
('that', 'CS'),
('entrance', 'NN'),
('is', 'BEZ'),
('direct', 'JJ'),
('.', '.')]
4、组合标注器
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)
Out[31]: 0.8452108043456593
5、存储标注器
# 储存标注器
from pickle import dump
output = open('t2.pkl', 'wb')
dump(t2, output, -1)
output.close()
from pickle import load
input = open('t2.pkl', 'rb')
tagger = load(input)
input.close()
text = """The board's action shows what free enterprise
is up against in our complex maze of regulatory laws ."""
tokens = text.split()
tagger.tag(tokens)
Out[34]:
[('The', 'AT'),
("board's", 'NN$'),
('action', 'NN'),
('shows', 'NNS'),
('what', 'WDT'),
('free', 'JJ'),
('enterprise', 'NN'),
('is', 'BEZ'),
('up', 'RP'),
('against', 'IN'),
('in', 'IN'),
('our', 'PP$'),
('complex', 'JJ'),
('maze', 'NN'),
('of', 'IN'),
('regulatory', 'NN'),
('laws', 'NNS'),
('.', '.')]
6、跨句子边界标注器
# 跨句子边界标注
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)
Out[36]: 0.8452108043456593