分类和标注词汇

最新推荐文章于 2025-03-26 21:02:05 发布

昂咕咕

最新推荐文章于 2025-03-26 21:02:05 发布

阅读量466

点赞数

文章标签： python 人工智能自然语言处理

本文链接：https://blog.csdn.net/weixin_50397333/article/details/115607834

版权

将词汇按它们的词性（parts-of-speech，POS）分类以及相应的标注它们的过程被称为词性标注（part-of-speech tagging, POS tagging）或干脆简称标注。词性也称为词类或词汇范畴。用于特定任务的标记的集合被称为一个标记集。我们在本章的重点是利用标记和自动标
注文本。

使用词性标注器

一个词性标注器（part-of-speech tagger 或 POS tagger）处理一个词序列，为每个词附加一个词性标记（不要忘记 import nltk）：

>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]

在这里我们看到 and 是 CC，并列连词；now 和 completely 是 RB，副词；for 是 IN，介词；something 是 NN，名词；different 是 JJ，形容词。

>>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")
>>> nltk.pos_tag(text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),
('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

请注意 refuse 和 permit 都以一般现在时动词（VBP）和名词（NN）形式出现

词汇类别，如“名词”，和词性标记，如 NN，看上去似乎有其用途，但在细节上将使许多读者感到晦涩。你可能想知道为什么要引进这种额外的信息。这些类别中很多都源于对文本中词的分布的浅层的分析。考虑下面的分析，涉及 woman（名词），bought（动词），over（介词）和 the（限定词）。 text.similar() 方法为一个词 w 找出所有上下文 w1ww2，然后找出所有出现在相同上下文中的词 w’，即 w1w’w2。

>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
>>> text.similar('woman')
Building word-context index...
man time day year car moment world family house country child boy
state job way war girl place room word
>>> text.similar('bought')
made said put done seen had found left given heard brought got been
was set told took in felt that
>>> text.similar('over')
in on to of and for with from at by that into as up out down through
is all about
>>> text.similar('the')
a his this their its her an that our any all one these my in your no
some other and

请看搜索 woman 找到名词；搜索 bought 找到的大部分是动词；搜索 over 一般会找到介词；搜索 the 找到几个限定词。

标注语料库

表示已标注的标识符

按照 NLTK 的约定，一个已标注的标识符使用一个由标识符和标记组成的元组来表示。我们可以使用函数 str2tuple() 从表示一个已标注的标识符的标准字符串创建一个这样的特殊元组：

>>> tagged_token = nltk.tag.str2tuple('fly/NN')
>>> tagged_token
('fly', 'NN')
>>> tagged_token[0]
'fly'
>>> tagged_token[1]
'NN'

我们可以直接从一个字符串构造一个已标注的标识符的链表。第一步是对字符串分词以便能访问单独的词/标记字符串，然后将每一个转换成一个元组（使用 str2tuple()）。

>>> sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN
... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC
... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PP
... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/R
... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT
... interest/NN of/IN both/ABX governments/NNS ''/'' ./.
... '''
>>> [nltk.tag.str2tuple(t) for t in sent.split()]
[('The', 'AT '), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),
('on', 'IN'), ('a', 'AT '), ('number', 'NN'), ... ('.', '.')]

读取已标注的语料库

NLTK 中包括的若干语料库已标注了词性，比如你用文本编辑器打开一个布朗语料库的材料就能看到的例子：
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta’s/np$ recent/jj primary/nn election/nn produced/vbd / no/at
evidence/nn ‘’/’’ that/cs any/dti irregularities/nns took/vbd place/nn ./.

其他语料库使用各种格式存储词性标记。NLTK 中的语料库阅读器提供了一个统一的接口，使你不必理会这些不同的文件格式。

>>> nltk.corpus.brown.tagged_words()
[('The', 'AT '), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]
>>> nltk.corpus.brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]

只要语料库包含已标注的文本，NLTK 的语料库接口都将有一个 tagged_words()方法。

>>> print nltk.corpus.nps_chat.tagged_words()
[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]
>>> nltk.corpus.conll2000.tagged_words()
[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]
>>> nltk.corpus.treebank.tagged_words()
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]

并非所有的语料库都采用同一组标记，想避免这些标记集的复杂化，所以我们使用一个内置的到一个简化的标记集的映射：

>>> nltk.corpus.brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]
>>> nltk.corpus.treebank.tagged_words(simplify_tags=True)
[('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...]

如果语料库也被分割成句子，将有一个 tagged_sents()方法将已标注的词划分成句子，而不是将它们表示成一个大链表。

简化的词性标记集

在这里插入图片描述

名词

简化的名词标记对普通名词是 N，如：书，对专有名词是 NP，如苏格兰。
名词可能出现在限定词和形容词之后，可以是动词的主语或宾语

动词

动词是用来描述事件和行动的词

形容词和副词

形容词修饰名词，可以作为修饰符，副词也可以修饰的形容词。

使用 Python 字典映射词及其属性

在 Python 中最自然的方式存储映射是使用所谓的字典数据类型

索引链表 VS 字典

文本在 Python 中被视为一个词链表，可通过索引查看特定项目。
在那里我们指定一个词然后取回一个数字，如：fdist[‘monstrous’]，它告诉我们一个给定的词在文本中出现的次数。用词查询类似与使用一本字典

语言学对象从键到值的映射
在这里插入图片描述

Python 字典

>>> pos = {}
>>> pos
{}
>>> pos['colorless'] = 'ADJ' 
>>> pos
{'colorless': 'ADJ'}
>>> pos['ideas'] = 'N'
>>> pos['sleep'] = 'V'
>>> pos['furiously'] = 'ADV'
>>> pos 
{'furiously': 'ADV', 'ideas': 'N', 'colorless': 'ADJ', 'sleep': 'V'}

使用键来检索值：

>>> pos['ideas']
'N'
>>> pos['colorless']
'ADJ'

如何算出一个字典的合法键？如果字典不是太大，我们可以简单地通过查看变量 pos 检查它的内容。

另外，要找到键，我们可以将字典转换成一个链表①或在需要使用链表的地方使用字典，如作为 sorted()的参数②或用在 for 循环中③。

>>> list(pos) ①
['ideas', 'furiously', 'colorless', 'sleep']
>>> sorted(pos) ②
['colorless', 'furiously', 'ideas', 'sleep']
>>> [w for w in pos if w.endswith('s')] ③
['colorless', 'ideas']

与使用一个 for 循环遍历字典中的所有键一样，我们可以使用 for 循环输出字典的内容：

>>> for word in sorted(pos):
... print word + ":", pos[word]
...
colorless: ADJ
furiously: ADV
sleep: V
ideas: N

最后，字典的方法 keys()、values()和 items()允许我们访问作为单独的链表的键、值以及键-值对。我们甚至可以按它们的第一个元素排序元组①（如果第一个元素相同，就使用它们的第二个元素）。

PS：一个键只能有一个值

>>> pos['sleep'] = 'V'
>>> pos['sleep']
'V'
>>> pos['sleep'] = 'N'
>>> pos['sleep']
'N'

一开始，pos[‘sleep’]给的值是’V’。但是，它立即被一个新值’N’覆盖了。换句话说，字典中只能有’sleep’的一个条目。然而，有一个方法可以在该项目中存储多个值：我们使用一个链表值，例如：pos[‘sleep’] = [‘N’, ‘V’]。

定义字典

可以使用键-值对格式创建字典。有两种方式做这个，我们通常会使用第一个：

>>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
>>> pos = dict(colorless='ADJ', ideas='N', sleep='V', furiously='ADV')

请注意：字典的键必须是不可改变的类型，如字符串和元组。如果我们尝试使用可变键定义字典会得到一个 TypeError：

>>> pos = {['ideas', 'blogs', 'adventures']: 'N'}
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: list objects are unhashable

默认字典

如果我们试图访问一个不在字典中的键，会得到一个错误。然而，如果一个字典能为这个新键自动创建一个条目并给它一个默认值，如 0 或者一个空链表，将是有用的。自从 Python 2.5 以来，一种特殊的称为 defaultdict 的字典已经出现。（考虑到有读者使用 Python2.4，NLTK 提供了 nltk.defaultdict。）为了使用它，我们必须提供一个参数，用来创建默
认值，如：int、float、str、list、dict、tuple。

>>> frequency = nltk.defaultdict(int)
>>> frequency['colorless'] = 4
>>> frequency['ideas']
0
>>> pos = nltk.defaultdict(list)
>>> pos['sleep'] = ['N', 'V']
>>> pos['ideas']
[]

这些默认值实际上是将其他对象转换为指定类型的函数（例如：int(“2”)、list(“2”)）。当它们被调用的时候没有参数——也就是说，int()、list()— —分别返回 0 和[]。

然而，也可以指定任何我们喜欢的默认值，只要提供可以无参数的被调用产生所需值的函数的名子。让我们回到我们的词性的例子，创建一个任一条目的默认值是’N’的字典①。当我们访问一个不存在的条目②时，它会自动添加到字典③。

>>> pos = nltk.defaultdict(lambda: 'N') ①
>>> pos['colorless'] = 'ADJ'
>>> pos['blog'] ②
'N'
>>> pos.items()
[('blog', 'N'), ('colorless', 'ADJ')] ③

这个例子使用一个 lambda 表达式。这个 lambda 表达式没有指定参数，所以我们用不带参数的括号调用它。因此，下面的 f 和 g的定义是等价的：

>>> f = lambda: 'N'
>>> f()
'N'
>>> def g():
... return 'N'
>>> g()
'N

递增地更新字典

递增地更新字典，按值排序。

>>> counts = nltk.defaultdict(int)
>>> from nltk.corpus import brown
>>> for (word, tag) in brown.tagged_words(categories='news'):
... counts[tag] += 1
...
>>> counts['N']
22226
>>> list(counts)
['FW', 'DET', 'WH', "''", 'VBZ', 'VB+PPO', "'", ')', 'ADJ', 'PRO', '*', '-', ...]
>>> from operator import itemgetter
>>> sorted(counts.items(), key=itemgetter(1), reverse=True)
[('N', 22226), ('P', 10845), ('DET', 10648), ('NP', 8336), ('V', 7313), ...]
>>> [t for t, c in sorted(counts.items(), key=itemgetter(1), reverse=True)]
['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]

列表演示了一个重要的按值排序一个字典的习惯用法，按频率递减顺序显示词汇。sorted()的第一个参数是要排序的项目，它是由一个 POS 标记和一个频率组成的元组的链表。第二个参数使用函数 itemgetter()指定排序键。在一般情况下，itemgetter(n)返回一个函数，这个函数可以在一些其他序列对象上被调用获得这个序列的第 n 个元素的。

>>> pair = ('NP', 8336)
>>> pair[1]
8336
>>> itemgetter(1)(pair)
8336

sorted()的最后一个参数指定项目是否应被按相反的顺序返回，即频率值递减。

Python 字典方法：常用的方法与字典相关习惯用法的总结

在这里插入图片描述

自动标注

我们以加载将要使用的数据开始

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')

默认标注器

（现在使用未简化标记集）：

>> tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
>>> nltk.FreqDist(tags).max()
'NN'

现在我们可以创建一个将所有词都标注成 NN 的标注器。

>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
>>> tokens = nltk.word_tokenize(raw)
>>> default_tagger = nltk.DefaultTagger('NN')
>>> default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'),
('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'),
('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'),
('I', 'NN'), ('am', 'NN'), ('!', 'NN')]

不出所料，这种方法的表现相当不好。在一个典型的语料库中，它只标注正确了八分之
一的标识符，正如我们在这里看到的：

>>> default_tagger.evaluate(brown_tagged_sents)
0.13089484257215028

正则表达式标注器

正则表达式标注器基于匹配模式分配标记给标识符。

>>> patterns = [
... (r'.*ing$', 'VBG'), # gerunds
... (r'.*ed$', 'VBD'), # simple past
... (r'.*es$', 'VBZ'), # 3rd singular present
... (r'.*ould$', 'MD'), # modals
... (r'.*\'s$', 'NN$'), # possessive nouns
... (r'.*s$', 'NNS'), # plural nouns
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
... (r'.*', 'NN') # nouns (default)
... ]

请注意，这些是顺序处理的，第一个匹配上的会被使用。现在我们可以建立一个标注器，并用它来标记一个句子。做完这一步会有约五分之一是正确的

>>> regexp_tagger = nltk.RegexpTagger(patterns)
>>> regexp_tagger.tag(brown_sents[3])
[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'),
('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'),
("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'),
('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ...]
>>> regexp_tagger.evaluate(brown_tagged_sents)
0.20326391789486245

N-gram 标注

一元标注（Unigram Tagging）

一元标注器基于一个简单的统计算法：对每个标识符分配这个独特的标识符最有可能的标记。

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')
>>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
>>> unigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'),
('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'),
(',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'),
('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'),
('direct', 'JJ'), ('.', '.')]
>>> unigram_tagger.evaluate(brown_tagged_sents) # 评估
0.9349006503968017

分离训练和测试数据

我们应该分割数据，90％为测试数据，其余 10％为测试数据：

>>> size = int(len(brown_tagged_sents) * 0.9)
>>> size
4160
>>> train_sents = brown_tagged_sents[:size]
>>> test_sents = brown_tagged_sents[size:]
>>> unigram_tagger = nltk.UnigramTagger(train_sents)
>>> unigram_tagger.evaluate(test_sents)
0.81202033290142528

一般的 N-gram 的标注

一个 n-gram 标注器是一个 unigram 标注器的一般化，它的上下文是当前词和它前面 n-1 个标识符的词性标记。

NgramTagger 类使用一个已标注的训练语料库来确定对每个上下文哪个词性标记最有可能。在这里，我们看到一个 n-gram 标注器的特殊情况，即一个 bigram 标注器。首先，我们训练它，然后用它来标注未标注的句子：

>>> bigram_tagger = nltk.BigramTagger(train_sents)
>>> bigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'),
('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'),
('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'),
('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'),
('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]
>>> unseen_sent = brown_sents[4203]
>>> bigram_tagger.tag(unseen_sent)
[('The', 'AT '), ('population', 'NN'), ('of', 'IN'), ('the', 'AT '), ('Congo', 'NP'),
('is', 'BEZ'), ('13.5', None), ('million', None), (',', None), ('divided', None),
('into', None), ('at', None), ('least', None), ('seven', None), ('major', None),
('``', None), ('culture', None), ('clusters', None), ("''", None), ('and', None),
('innumerable', None), ('tribes', None), ('speaking', None), ('400', None),
('separate', None), ('dialects', None), ('.', None)]

注意！
N-gram 标注器不应考虑跨越句子边界的上下文。因此，NLTK 的标注器被
设计用于句子链表，一个句子是一个词链表。在一个句子的开始，T_n-1和前面的标记被设置为 None。

组合标注器

我们可以按如下方式组合 bigram 标注器、unigram 标注器和一个默认标注器：

尝试使用 bigram 标注器标注标识符。
如果 bigram 标注器无法找到一个标记，尝试 unigram 标注器。
如果 unigram 标注器也无法找到一个标记，使用默认标注器。
大多数 NLTK 标注器允许指定一个回退标注器。回退标注器自身可能也有一个回退标注器：

>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
>>> t2.evaluate(test_sents)
0.84491179108940495

存储标注器

保存我们的标注器 t2 到文件 t2.pkl：

>>> from cPickle import dump
>>> output = open('t2.pkl', 'wb')
>>> dump(t2, output, -1)
>>> output.close()

现在，我们可以在一个单独的 Python 进程中载入我们保存的标注器：

>>> from cPickle import load
>>> input = open('t2.pkl', 'rb')
>>> tagger = load(input)
>>> input.close()

小结

词可以组成类，如名词、动词、形容词以及副词。这些类被称为词汇范畴或者词性。词性被分配短标签或者标记，如 NN 和 VB。
给文本中的词自动分配词性的过程称为词性标注、POS 标注或只是标注。
自动标注是 NLP 流程中重要的一步，在各种情况下都十分有用，包括预测先前未见过的词的行为、分析语料库中词的使用以及文本到语音转换系统。
一些语言学语料库，如布朗语料库，已经做了词性标注。
有多种标注方法，如默认标注器、正则表达式标注器、unigram 标注器、n-gram 标注器。这些都可以结合一种叫做回退的技术一起使用。
标注器可以使用已标注语料库进行训练和评估。
回退是一个组合模型的方法：当一个较专业的模型（如 bigram 标注器）不能为给定内容分配标记时，我们回退到一个较一般的模型（如 unigram 标注器）
词性标注是 NLP 中一个重要的早期的序列分类任务：利用局部上下文语境中的词和标记对序列中任意一点的分类决策。
字典用来映射任意类型之间的信息，如字符串和数字：freq[‘cat’]=12。我们使用大括号来创建字典：pos = {}，pos = {‘furiously’: ‘adv’, ‘ideas’: ‘n’, ‘colorless’:‘adj’}。
N-gram 标注器可以定义较大数值的 n，但是当 n 大于 3 时，我们常常会面临数据稀疏问题；即使使用大量的训练数据，我们看到的也只是可能的上下文的一小部分。
基于转换的标注学习一系列“改变标记 s 为标记 t 在上下文 c 中”形式的修复规则，每个规则会修复错误，也可能引入（较小的）错误。