NLTK 学习笔记（5）

最新推荐文章于 2021-05-09 23:18:37 发布

liuha511

最新推荐文章于 2021-05-09 23:18:37 发布

阅读量997

点赞数

分类专栏： NLP 文章标签：自然语言处理 nltk

本文链接：https://blog.csdn.net/liuha511/article/details/41647623

版权

NLP 专栏收录该内容

10 篇文章 0 订阅

订阅专栏

1. 文档分类

Step1: 我们根据已经分类好的语料库构建一个前2000个最频繁词的链表。然后，定义一个特征提取器，简单地检查这些词是否在一个给定的文档中。

>>> from nltk.corpus import movie_reviews
>>> documents = [(list(movie_reviews.words(fileid)), category)
...     for category in movie_reviews.categories()
...     for fileid in movie_reviews.fileids(category)]
>>> random.shuffle(documents)
>>> all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
>>> word_features = all_words.keys()[:2000]
>>> def document_features(document):
...     document_words = set(document)
...     features = {}
...     for word in word_features:
...             features['contains(%s)' % word] = (word in document_words)
...     return features
... 
>>> print document_features(movie_reviews.words('pos/cv957_8737.txt'))
{u'contains(corporate)': False, u'contains(barred)': False, u'contains(batmans)': False,...}

We compute the set of all words in a document , rather than just checking ifword in document, because checking whether a word occurs in a set is much faster than checking whether it occurs in a list.

Step2: 现在我们已经定义了我们的特征提取器，可以用它来训练一个分类器，为新的电影评论加标签。为了检查产生的分类器可靠性如何，我们在测试集上计算其准确性。然后看看最有效的是哪些feature。（再一次与书上0.81不同的是我的0.65远远落后了，人品？）

>>> featuresets = [(document_features(d), c) for (d,c) in documents]
>>> train_set, test_set = featuresets[100:], featuresets[:100]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, test_set)
0.65
>>> classifier.show_most_informative_features(5)
Most Informative Features
          contains(sans) = True              neg : pos    =      8.3 : 1.0
    contains(mediocrity) = True              neg : pos    =      7.6 : 1.0
     contains(dismissed) = True              pos : neg    =      7.1 : 1.0
   contains(overwhelmed) = True              pos : neg    =      6.4 : 1.0
   contains(bruckheimer) = True              neg : pos    =      6.3 : 1.0

2. POS Tagging

我们下面的目标是训练一个分类器来根据单词后缀推测楚词性。首先，让我们找出最常见的后缀。

之后我们将定义一个特征提取器函数，检查给定的单词的这些后缀。

然后用我们的特征提取器来训练一个新的“决策树”的分类器。

>>> from nltk.corpus import brown
>>> suffix_fdist = nltk.FreqDist()
>>> for word in brown.words():
...     word = word.lower()
...     suffix_fdist[word[-1:]]+=1
...     suffix_fdist[word[-2:]]+=1
...     suffix_fdist[word[-3:]]+=1
... 
<p class="p1"><span class="s1">>>> common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]</span></p><p class="p1"><span class="s1">>>> print(common_suffixes)</span></p><p class="p1"><span class="s1">[u'e', u',', u'.', u's', u'd', u't', u'he', u'n', u'a', u'of', u'the', u'y', u'r', u'to', u'in', u'f', u'o', u'ed', u'nd', u'is', u'on', u'l', u'g', u'and', u'ng', u'er', u'as', u'ing', u'h', u'at', u'es', u'or', u're', u'it', u'``', u'an', u"''", u'm', u';', u'i', u'ly', u'ion', u'en', u'al', u'?', u'nt', u'be', u'hat', u'st', u'his', u'th', u'll', u'le', u'ce', u'by', u'ts', u'me', u've', u"'", u'se', u'ut', u'was', u'for', u'ent', u'ch', u'k', u'w', u'ld', u'`', u'rs', u'ted', u'ere', u'her', u'ne', u'ns', u'ith', u'ad', u'ry', u')', u'(', u'te', u'--', u'ay', u'ty', u'ot', u'p', u'nce', u"'s", u'ter', u'om', u'ss', u':', u'we', u'are', u'c', u'ers', u'uld', u'had', u'so', u'ey']</span></p>>>> def pos_features(word):
...     features = {}
...     for suffix in common_suffixes:
...             features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
...     return features
... 
>>> tagged_words = brown.tagged_words(categories='news')
>>> featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
>>> size = int(len(featuresets) * 0.1)
>>> train_set, test_set = featuresets[size:], featuresets[:size]
>>> classifier = nltk.DecisionTreeClassifier.train(train_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.62705121829935351
>>> classifier.classify(pos_features('cats'))
'NNS'

注意：最后三行是我从书上copy过来的，我的decisionTreeClassifier实在太耗时间了，不知道现在有没有完成，我再看看去。

决策树模型的一个很好的性质是它们往往很容易解释。我们甚至可以指示NLTK 将它们以伪代码形式输出：

>>> print classifier.pseudocode(depth=4)
<span style="white-space:pre">	</span>if endswith(,) == True: return ','
<span style="white-space:pre">	</span>if endswith(,) == False:
<span style="white-space:pre">		</span>if endswith(the) == True: return 'AT'
<span style="white-space:pre">			</span>if endswith(the) == False:
<span style="white-space:pre">				</span>if endswith(s) == True:
<span style="white-space:pre">					</span>if endswith(is) == True: return 'BEZ'
<span style="white-space:pre">					</span>if endswith(is) == False: return 'VBZ'
<span style="white-space:pre">				</span>if endswith(s) == False:
<span style="white-space:pre">					</span>if endswith(.) == True: return '.'
<span style="white-space:pre">		</span>if endswith(.) == False: return 'NN'