词性标注器

最新推荐文章于 2024-03-17 09:53:55 发布

象在舞

最新推荐文章于 2024-03-17 09:53:55 发布

阅读量1.1k

点赞数 5

分类专栏： Python 自然语言处理 Python自然语言处理文章标签： nlp 词性标注器

本文链接：https://blog.csdn.net/gdkyxy2013/article/details/87777387

版权

Python 同时被 3 个专栏收录

55 篇文章 6 订阅

订阅专栏

自然语言处理

14 篇文章 3 订阅

订阅专栏

Python自然语言处理

12 篇文章 21 订阅

订阅专栏

声明：代码的运行环境为Python3。Python3与Python2在一些细节上会有所不同，希望广大读者注意。本博客以代码为主，代码中会有详细的注释。相关文章将会发布在我的个人博客专栏《Python自然语言处理》，欢迎大家关注。

词性标注在自然语言处理中也是很重要的一环，本篇文章将主要介绍一下相关的词性标注器，一起来看看吧~

一、词性的初测

分词不同的内容，可能用于进行词性标注标识是不同的，例如：

import nltk
from nltk.tag import pos_tag  # 词性标注器
from nltk.tokenize import word_tokenize
import pickle
# 词性标注器
text = word_tokenize("And now for something completely different")
print(pos_tag(text))

[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

text = word_tokenize("They refuse to permit us to obtain the refuse permit")
print(pos_tag(text))

[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

此时可以通过设置tagset='universal'使得词性标注统一：

text = word_tokenize("And now for something completely different")
print(pos_tag(text, tagset='universal'))

[('And', 'CONJ'), ('now', 'ADV'), ('for', 'ADP'), ('something', 'NOUN'), ('completely', 'ADV'), ('different', 'ADJ')]

text = word_tokenize("They refuse to permit us to obtain the refuse permit")
print(pos_tag(text, tagset='universal'))

[('They', 'PRON'), ('refuse', 'VERB'), ('to', 'PRT'), ('permit', 'VERB'), ('us', 'PRON'), ('to', 'PRT'), ('obtain', 'VERB'), ('the', 'DET'), ('refuse', 'NOUN'), ('permit', 'NOUN')]

统一的词性标注说明如下：

可以根据单词的词性进行相似词性的单词搜索，例如：

text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
text.similar('woman')

man time day year car moment world house family child country boy
state job place way war girl work word

也可以对某个语料库进行标注：

# 标注语料库
tagged_token = nltk.tag.str2tuple('fly/NN')  # 将字符串转化成元组
print(tagged_token)

('fly', 'NN')

示例：1.查找布朗语料库中最常见的标注信息

from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')  # 查找布朗语料库最常见的标注信息
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
tag_fd.most_common()
tag_fd.plot()

Backend Qt5Agg is interactive backend. Turning interactive mode on.

2.查找布朗语料库中与名词挨着的词的词性频率

word_tag_pairs = list(nltk.bigrams(brown_news_tagged))
nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'NOUN').most_common()

Out[13]: 
[('NOUN', 7959),
 ('DET', 7373),
 ('ADJ', 4761),
 ('ADP', 3781),
 ('.', 2796),
 ('VERB', 1842),
 ('CONJ', 938),
 ('NUM', 894),
 ('ADV', 186),
 ('PRT', 94),
 ('PRON', 19),
 ('X', 11)]

通过上面的例子可以发现，在布朗语料库中，与名词挨着的出现频率最大的还是名词。

二、自动标注

1、默认标注器

from nltk.corpus import brown
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
tokens = nltk.word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN')  # 默认的标注器标注为nn
default_tagger.tag(tokens)  # 所有的词性全部变为名词

Out[15]: 
[('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('green', 'NN'),
 ('eggs', 'NN'),
 ('and', 'NN'),
 ('ham', 'NN'),
 (',', 'NN'),
 ('I', 'NN'),
 ('do', 'NN'),
 ('not', 'NN'),
 ('like', 'NN'),
 ('them', 'NN'),
 ('Sam', 'NN'),
 ('I', 'NN'),
 ('am', 'NN'),
 ('!', 'NN')]

default_tagger.evaluate(brown.tagged_sents(categories='news'))  # 计算标注的词语符合要求的占比

Out[16]: 0.13089484257215028

2、正则表达式标注器

# 正则表达式标注器
patterns = [  # 自定义模式
    (r'.*ing$', 'VBG'),  # gerunds
    (r'.*ed$', 'VBD'),  # simple past
    (r'.*es$', 'VBZ'),  # 3rd singular present
    (r'.*ould$', 'MD'),  # modals
    (r'.*\'s$', 'NN$'),  # possessive nouns
    (r'.*s$', 'NNS'),  # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'),  # cardinal numbers
    (r'.*', 'NN')  # nouns (default)
]
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown.sents()[3])

Out[17]: 
[('``', 'NN'),
 ('Only', 'NN'),
 ('a', 'NN'),
 ('relative', 'NN'),
 ('handful', 'NN'),
 ('of', 'NN'),
 ('such', 'NN'),
 ('reports', 'NNS'),
 ('was', 'NNS'),
 ('received', 'VBD'),
 ("''", 'NN'),
 (',', 'NN'),
 ('the', 'NN'),
 ('jury', 'NN'),
 ('said', 'NN'),
 (',', 'NN'),
 ('``', 'NN'),
 ('considering', 'VBG'),
 ('the', 'NN'),
 ('widespread', 'NN'),
 ('interest', 'NN'),
 ('in', 'NN'),
 ('the', 'NN'),
 ('election', 'NN'),
 (',', 'NN'),
 ('the', 'NN'),
 ('number', 'NN'),
 ('of', 'NN'),
 ('voters', 'NNS'),
 ('and', 'NN'),
 ('the', 'NN'),
 ('size', 'NN'),
 ('of', 'NN'),
 ('this', 'NNS'),
 ('city', 'NN'),
 ("''", 'NN'),
 ('.', 'NN')]

regexp_tagger.evaluate(brown.tagged_sents(categories='news'))  # 计算符合要求的占比

Out[18]: 0.20326391789486245

3、查询标注器

# 查询标注器
fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
most_freq_words = fd.most_common()[:100]
likely_tags = dict((word, cfd[word].max()) for (word, freq) in most_freq_words)
baseline_tagger = nltk.UnigramTagger(model=likely_tags)
baseline_tagger.evaluate(brown.tagged_sents(categories='news'))

Out[22]: 0.45578495136941344

三、N-gram标注

1、一元模型

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
print(unigram_tagger.tag(brown_sents[2007]))

[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]

print(unigram_tagger.evaluate(brown_tagged_sents))

0.9349006503968017

2、分离训练与测试数据

# 分离训练与测试数据
size = int(len(brown_tagged_sents) * 0.9)
print(size)

4160

train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)

Out[29]: 0.8121200039868434

3、一般N-gram标注

# 一般N-gram标注
bigram_tagger = nltk.BigramTagger(train_sents)
bigram_tagger.tag(brown_sents[2007])

Out[30]: 
[('Various', 'JJ'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('the', 'AT'),
 ('terrace', 'NN'),
 ('type', 'NN'),
 (',', ','),
 ('being', 'BEG'),
 ('on', 'IN'),
 ('the', 'AT'),
 ('ground', 'NN'),
 ('floor', 'NN'),
 ('so', 'CS'),
 ('that', 'CS'),
 ('entrance', 'NN'),
 ('is', 'BEZ'),
 ('direct', 'JJ'),
 ('.', '.')]

4、组合标注器

t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

Out[31]: 0.8452108043456593

5、存储标注器

# 储存标注器
from pickle import dump
output = open('t2.pkl', 'wb')
dump(t2, output, -1)
output.close()
from pickle import load
input = open('t2.pkl', 'rb')
tagger = load(input)
input.close()
text = """The board's action shows what free enterprise
    is up against in our complex maze of regulatory laws ."""
tokens = text.split()
tagger.tag(tokens)

Out[34]: 
[('The', 'AT'),
 ("board's", 'NN$'),
 ('action', 'NN'),
 ('shows', 'NNS'),
 ('what', 'WDT'),
 ('free', 'JJ'),
 ('enterprise', 'NN'),
 ('is', 'BEZ'),
 ('up', 'RP'),
 ('against', 'IN'),
 ('in', 'IN'),
 ('our', 'PP$'),
 ('complex', 'JJ'),
 ('maze', 'NN'),
 ('of', 'IN'),
 ('regulatory', 'NN'),
 ('laws', 'NNS'),
 ('.', '.')]

6、跨句子边界标注器

# 跨句子边界标注
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents, backoff=t0)
t2 = nltk.BigramTagger(train_sents, backoff=t1)
t2.evaluate(test_sents)

Out[36]: 0.8452108043456593