分类和标注词汇(tagging)
1. POS tagger
>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]
2. 自动标注
(1)找到brown词库中出现频率最大的标识
>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
>>> nltk.FreqDist(tags).max()
u'NN'
(2)简单标注器及其效果(把所有的token都标记成NN)
>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
>>> tokens = nltk.word_tokenize(raw)
>>> default_tagger = nltk.DefaultTagger('NN')
>>> default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]
>>> default_tagger.evaluate(brown_tagged_sents)
0.13089484257215028
(3)正则表达式标注器及其效果
>>> patterns=[
... (r'.*ing$','VBG'),
... (r'.*ed$', 'VBD'), # simple past
... (r'.*es$', 'VBZ'), # 3rd singular present
... (r'.*ould$', 'MD'), # modals
... (r'.*\'s$', 'NN$'), # possessive nouns
... (r'.*s$', 'NNS'), # plural nouns
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
... (r'.*', 'NN') # nouns (default)
... ]
>>> regexp_tagger = nltk.RegexpTagger(patterns)
>>> regexp_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NNS'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]
>>> regexp_tagger.evaluate(brown_tagged_sents)
0.20326391789486245
(4)查询标识器及其效果
(A lot of high-frequency words do not have the NN tag. Let’s find the hundred most frequent words and store their most likely tag. We can then use this information as the model for a “lookup tagger” (an NLTKUnigramTagger ))
>>> fd = nltk.FreqDist(brown.words(categories='news'))
>>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
>>> most_freq_words = fd.keys()[:100]
>>> likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)
>>> baseline_tagger.evaluate(brown_tagged_sents)
0.45578495136941344
( 注:这是书上的结果,我的运算evaluate是低的可怜的0.005171350717027666,大概是因为本地环境里brown.tagged_words没下载完全的原因。)
再看看输出结果
>>> sent = brown.sents(categories='news')[3]
>>> baseline_tagger.tag(sent)
[('``', '``'), ('Only', None), ('a', 'AT'), ('relative', None),('handful', None), ('of', 'IN'), ('such', None), ('reports', None),
('was', 'BEDZ'), ('received', None), ("''", "''"), (',', ','),('the', 'AT'), ('jury', None), ('said', 'VBD'), (',', ','),('``', '``'), ('considering', None), ('the', 'AT'), ('widespread', None),('interest', None), ('in', 'IN'), ('the', 'AT'), ('election', None),(',', ','), ('the', 'AT'), ('number', None), ('of', 'IN'),
('voters', None), ('and', 'CC'), ('the', 'AT'), ('size', None),('of', 'IN'), ('this', 'DT'), ('city', None), ("''", "''"), ('.', '.')]
有太多的None了,于是要求我们backoff一下,
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags,
... backoff=nltk.DefaultTagger('NN'))
>>> baseline_tagger.tag(sent)
于是所有的None都被转换成了'NN'
(5)评价
直接引原文:
由于我们通常很难获得专业和公正的人的判断,所以使用黄金标准测试数据来代替。这是一个已经手动标注并作为自动系统评估标准而被接受的语料库。当标注器对给定词猜测的标记与黄金标准标记相同,标注器被视为是正确的。
(6) N-Grams tagging
Unigram
>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')
>>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
>>> unigram_tagger.tag(brown_sents[2007])
[(u'Various', u'JJ'), (u'of', u'IN'), (u'the', u'AT'), (u'apartments', u'NNS'), (u'are', u'BER'), (u'of', u'IN'), (u'the', u'AT'), (u'terrace', u'NN'), (u'type', u'NN'), (u',', u','), (u'being', u'BEG'), (u'on', u'IN'), (u'the', u'AT'), (u'ground', u'NN'), (u'floor', u'NN'), (u'so', u'QL'), (u'that', u'CS'), (u'entrance', u'NN'), (u'is', u'BEZ'), (u'direct', u'JJ'), (u'.', u'.')]
>>> unigram_tagger.evaluate(brown_tagged_sents)
0.9349006503968017
>>> size = int(len(brown_tagged_sents) * 0.9)
>>> size
4160
>>> train_sents = brown_tagged_sents[:size]
>>> test_sents = brown_tagged_sents[size:]
>>> bigram_tagger = nltk.BigramTagger(train_sents)
>>> bigram_tagger.tag(brown_sents[2007])
[(u'Various', u'JJ'), (u'of', u'IN'), (u'the', u'AT'), (u'apartments', u'NNS'), (u'are', u'BER'), (u'of', u'IN'), (u'the', u'AT'), (u'terrace', u'NN'), (u'type', u'NN'), (u',', u','), (u'being', u'BEG'), (u'on', u'IN'), (u'the', u'AT'), (u'ground', u'NN'), (u'floor', u'NN'), (u'so', u'CS'), (u'that', u'CS'), (u'entrance', u'NN'), (u'is', u'BEZ'), (u'direct', u'JJ'), (u'.', u'.')]
>>> bigram_tagger.evaluate(brown_tagged_sents)
0.7182210553533425
>>> unseen_sent = brown_sents[4203]
>>> bigram_tagger.tag(unseen_sent)
[(u'The', u'AT'), (u'population', u'NN'), (u'of', u'IN'), (u'the', u'AT'), (u'Congo', u'NP'), (u'is', u'BEZ'), (u'13.5', None), (u'million', None), (u',', None), (u'divided', None), (u'into', None), (u'at', None), (u'least', None), (u'seven', None), (u'major', None), (u'``', None), (u'culture', None), (u'clusters', None), (u"''", None), (u'and', None), (u'innumerable', None), (u'tribes', None), (u'speaking', None), (u'400', None), (u'separate', None), (u'dialects', None), (u'.', None)]
Notice that the bigram tagger manages to tag every word in a sentence it saw during training, but does badly on an unseen sentence. As soon as it encounters a new word
(i.e., 13.5 ), it is unable to assign a tag. It cannot tag the following word (i.e., million ),even if it was seen during training, simply because it never saw it during training with aNone tag on the previous word. Consequently, the tagger fails to tag the rest of thesentence. Its overall accuracy score is very low:
>>> bigram_tagger.evaluate(brown_tagged_sents)
0.7182210553533425
>>> bigram_tagger.evaluate(test_sents)
0.10276088906608193
当n 越大,上下文的特异性就会增加,我们要标注的数据中包含训练数据中不存在的上下文的几率也增大。这被称为数据稀疏问题,在NLP 中是相当普遍的。因此,我们的研究
结果的精度和覆盖范围之间需要有一个权衡(这与信息检索中的精度/召回权衡有关)。
组合标注器(Combining Taggers)
>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
>>> t2.evaluate(test_sents)
0.844911791089405
标注生词
Our approach to tagging unknown words still uses backoff to a regular expression tagger or a default tagger. These are unable to make use of context. Thus, if our tagger
encountered the word “blog” , not seen during training, it would assign it the same tag, regardless of whether this word appeared in the context “the blog” or “to blog” . How can
we do better with these unknown words, or out-of-vocabulary items?
A useful method to tag unknown words based on context is to limit the vocabulary of a tagger to the most frequent n words, and to replace every other word with a special
word UNK using the method shown in Section 5.3 . During training, a unigram tagger will probably learn that UNK is usually a noun. However, the n-gram taggers will detect
contexts in which it has some other tag. For example, if the preceding word is to (tagged TO ), then UNK will probably be tagged as a verb.