NLTK 学习笔记(3)

分类和标注词汇(tagging)

1. POS tagger

>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

2. 自动标注

(1)找到brown词库中出现频率最大的标识

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
>>> nltk.FreqDist(tags).max()
u'NN'
(2)简单标注器及其效果(把所有的token都标记成NN)

>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
>>> tokens = nltk.word_tokenize(raw)
>>> default_tagger = nltk.DefaultTagger('NN')
>>> default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]
>>> default_tagger.evaluate(brown_tagged_sents)
0.13089484257215028
(3)正则表达式标注器及其效果

>>> patterns=[
... (r'.*ing$','VBG'),
... (r'.*ed$', 'VBD'), # simple past
... (r'.*es$', 'VBZ'), # 3rd singular present
... (r'.*ould$', 'MD'), # modals
... (r'.*\'s$', 'NN$'), # possessive nouns
... (r'.*s$', 'NNS'), # plural nouns
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
... (r'.*', 'NN') # nouns (default)
... ]
>>> regexp_tagger = nltk.RegexpTagger(patterns)
>>> regexp_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NNS'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]
>>> regexp_tagger.evaluate(brown_tagged_sents)
0.20326391789486245


(4)查询标识器及其效果

(A lot of high-frequency words do not have the NN  tag. Let’s find the hundred most frequent words and store their most likely tag. We can then use this information as the model for a “lookup tagger” (an NLTKUnigramTagger ))

>>> fd = nltk.FreqDist(brown.words(categories='news'))
>>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
>>> most_freq_words = fd.keys()[:100]
>>> likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)
>>> baseline_tagger.evaluate(brown_tagged_sents)
0.45578495136941344

( 注:这是书上的结果,我的运算evaluate是低的可怜的0.005171350717027666,大概是因为本地环境里brown.tagged_words没下载完全的原因。)

再看看输出结果

>>> sent = brown.sents(categories='news')[3]
>>> baseline_tagger.tag(sent)
[('``', '``'), ('Only', None), ('a', 'AT'), ('relative', None),('handful', None), ('of', 'IN'), ('such', None), ('reports', None),
('was', 'BEDZ'), ('received', None), ("''", "''"), (',', ','),('the', 'AT'), ('jury', None), ('said', 'VBD'), (',', ','),('``', '``'), ('considering', None), ('the', 'AT'), ('widespread', None),('interest', None), ('in', 'IN'), ('the', 'AT'), ('election', None),(',', ','), ('the', 'AT'), ('number', None), ('of', 'IN'),
('voters', None), ('and', 'CC'), ('the', 'AT'), ('size', None),('of', 'IN'), ('this', 'DT'), ('city', None), ("''", "''"), ('.', '.')]

有太多的None了,于是要求我们backoff一下,

>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags,
...  backoff=nltk.DefaultTagger('NN'))
>>> baseline_tagger.tag(sent)

于是所有的None都被转换成了'NN'

(5)评价

直接引原文:

由于我们通常很难获得专业和公正的人的判断,所以使用黄金标准测试数据来代替。这是一个已经手动标注并作为自动系统评估标准而被接受的语料库。当标注器对给定词猜测的标记与黄金标准标记相同,标注器被视为是正确的。


(6) N-Grams tagging

Unigram

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')
>>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
>>> unigram_tagger.tag(brown_sents[2007])
[(u'Various', u'JJ'), (u'of', u'IN'), (u'the', u'AT'), (u'apartments', u'NNS'), (u'are', u'BER'), (u'of', u'IN'), (u'the', u'AT'), (u'terrace', u'NN'), (u'type', u'NN'), (u',', u','), (u'being', u'BEG'), (u'on', u'IN'), (u'the', u'AT'), (u'ground', u'NN'), (u'floor', u'NN'), (u'so', u'QL'), (u'that', u'CS'), (u'entrance', u'NN'), (u'is', u'BEZ'), (u'direct', u'JJ'), (u'.', u'.')]
>>> unigram_tagger.evaluate(brown_tagged_sents)
0.9349006503968017


Bigram

>>> size = int(len(brown_tagged_sents) * 0.9)
>>> size
4160
>>> train_sents = brown_tagged_sents[:size]
>>> test_sents = brown_tagged_sents[size:]
>>> bigram_tagger = nltk.BigramTagger(train_sents)
>>> bigram_tagger.tag(brown_sents[2007])
[(u'Various', u'JJ'), (u'of', u'IN'), (u'the', u'AT'), (u'apartments', u'NNS'), (u'are', u'BER'), (u'of', u'IN'), (u'the', u'AT'), (u'terrace', u'NN'), (u'type', u'NN'), (u',', u','), (u'being', u'BEG'), (u'on', u'IN'), (u'the', u'AT'), (u'ground', u'NN'), (u'floor', u'NN'), (u'so', u'CS'), (u'that', u'CS'), (u'entrance', u'NN'), (u'is', u'BEZ'), (u'direct', u'JJ'), (u'.', u'.')]
>>> bigram_tagger.evaluate(brown_tagged_sents)
0.7182210553533425


>>> unseen_sent = brown_sents[4203]
>>> bigram_tagger.tag(unseen_sent)
[(u'The', u'AT'), (u'population', u'NN'), (u'of', u'IN'), (u'the', u'AT'), (u'Congo', u'NP'), (u'is', u'BEZ'), (u'13.5', None), (u'million', None), (u',', None), (u'divided', None), (u'into', None), (u'at', None), (u'least', None), (u'seven', None), (u'major', None), (u'``', None), (u'culture', None), (u'clusters', None), (u"''", None), (u'and', None), (u'innumerable', None), (u'tribes', None), (u'speaking', None), (u'400', None), (u'separate', None), (u'dialects', None), (u'.', None)]

 Notice that the bigram tagger manages to tag every word in a sentence it saw during training, but does badly on an unseen sentence. As soon as it encounters a new word

(i.e., 13.5 ), it is unable to assign a tag. It cannot tag the following word (i.e., million ),even if it was seen during training, simply because it never saw it during training with aNone  tag on the previous word. Consequently, the tagger fails to tag the rest of thesentence. Its overall accuracy score is very low:

>>> bigram_tagger.evaluate(brown_tagged_sents)
0.7182210553533425
>>> bigram_tagger.evaluate(test_sents)
0.10276088906608193

当n 越大,上下文的特异性就会增加,我们要标注的数据中包含训练数据中不存在的上下文的几率也增大。这被称为数据稀疏问题,在NLP 中是相当普遍的。因此,我们的研究

结果的精度和覆盖范围之间需要有一个权衡(这与信息检索中的精度/召回权衡有关)。


组合标注器(Combining Taggers)

>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
>>> t2.evaluate(test_sents)
0.844911791089405

标注生词

 Our approach to tagging unknown words still uses backoff to a regular expression tagger or a default tagger. These are unable to make use of context. Thus, if our tagger

encountered the word “blog” , not seen during training, it would assign it the same tag, regardless of whether this word appeared in the context the blog  or “to blog” . How can

we do better with these unknown words, or out-of-vocabulary  items?

A useful method to tag unknown words based on context is to limit the vocabulary of a tagger to the most frequent n  words, and to replace every other word with a special

word UNK  using the method shown in Section 5.3 . During training, a unigram tagger will probably learn that UNK  is usually a noun. However, the n-gram taggers will detect

contexts in which it has some other tag. For example, if the preceding word is to  (tagged TO ), then UNK  will probably be tagged as a verb.







  • 1
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值