1. 文档分类
Step1: 我们根据已经分类好的语料库构建一个前2000个最频繁词的链表。然后,定义一个特征提取器,简单地检查这些词是否在一个给定的文档中。
>>> from nltk.corpus import movie_reviews
>>> documents = [(list(movie_reviews.words(fileid)), category)
... for category in movie_reviews.categories()
... for fileid in movie_reviews.fileids(category)]
>>> random.shuffle(documents)
>>> all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
>>> word_features = all_words.keys()[:2000]
>>> def document_features(document):
... document_words = set(document)
... features = {}
... for word in word_features:
... features['contains(%s)' % word] = (word in document_words)
... return features
...
>>> print document_features(movie_reviews.words('pos/cv957_8737.txt'))
{u'contains(corporate)': False, u'contains(barred)': False, u'contains(batmans)': False,...}
We compute the set of all words in a document , rather than just checking ifword in document, because checking whether a word occurs in a set is much faster than checking whether it occurs in a list.
Step2: 现在我们已经定义了我们的特征提取器,可以用它来训练一个分类器,为新的电影评论加标签。为了检查产生的分类器可靠性如何,我们在测试集上计算其准确性。然后看看最有效的是哪些feature。( 再一次与书上0.81不同的是我的0.65远远落后了,人品?)>>> featuresets = [(document_features(d), c) for (d,c) in documents]
>>> train_set, test_set = featuresets[100:], featuresets[:100]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> print nltk.classify.accuracy(classifier, test_set)
0.65
>>> classifier.show_most_informative_features(5)
Most Informative Features
contains(sans) = True neg : pos = 8.3 : 1.0
contains(mediocrity) = True neg : pos = 7.6 : 1.0
contains(dismissed) = True pos : neg = 7.1 : 1.0
contains(overwhelmed) = True pos : neg = 6.4 : 1.0
contains(bruckheimer) = True neg : pos = 6.3 : 1.0
2. POS Tagging
我们下面的目标是训练一个分类器来根据单词后缀推测楚词性。首先,让我们找出最常见的后缀。
之后我们将定义一个特征提取器函数,检查给定的单词的这些后缀。
然后用我们的特征提取器来训练一个新的“决策树”的分类器。
>>> from nltk.corpus import brown
>>> suffix_fdist = nltk.FreqDist()
>>> for word in brown.words():
... word = word.lower()
... suffix_fdist[word[-1:]]+=1
... suffix_fdist[word[-2:]]+=1
... suffix_fdist[word[-3:]]+=1
...
<p class="p1"><span class="s1">>>> common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]</span></p><p class="p1"><span class="s1">>>> print(common_suffixes)</span></p><p class="p1"><span class="s1">[u'e', u',', u'.', u's', u'd', u't', u'he', u'n', u'a', u'of', u'the', u'y', u'r', u'to', u'in', u'f', u'o', u'ed', u'nd', u'is', u'on', u'l', u'g', u'and', u'ng', u'er', u'as', u'ing', u'h', u'at', u'es', u'or', u're', u'it', u'``', u'an', u"''", u'm', u';', u'i', u'ly', u'ion', u'en', u'al', u'?', u'nt', u'be', u'hat', u'st', u'his', u'th', u'll', u'le', u'ce', u'by', u'ts', u'me', u've', u"'", u'se', u'ut', u'was', u'for', u'ent', u'ch', u'k', u'w', u'ld', u'`', u'rs', u'ted', u'ere', u'her', u'ne', u'ns', u'ith', u'ad', u'ry', u')', u'(', u'te', u'--', u'ay', u'ty', u'ot', u'p', u'nce', u"'s", u'ter', u'om', u'ss', u':', u'we', u'are', u'c', u'ers', u'uld', u'had', u'so', u'ey']</span></p>>>> def pos_features(word):
... features = {}
... for suffix in common_suffixes:
... features['endswith(%s)' % suffix] = word.lower().endswith(suffix)
... return features
...
>>> tagged_words = brown.tagged_words(categories='news')
>>> featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
>>> size = int(len(featuresets) * 0.1)
>>> train_set, test_set = featuresets[size:], featuresets[:size]
>>> classifier = nltk.DecisionTreeClassifier.train(train_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.62705121829935351
>>> classifier.classify(pos_features('cats'))
'NNS'
注意:最后三行是我从书上copy过来的,我的decisionTreeClassifier实在太耗时间了,不知道现在有没有完成,我再看看去。
决策树模型的一个很好的性质是它们往往很容易解释。我们甚至可以指示NLTK 将它们以伪代码形式输出:
>>> print classifier.pseudocode(depth=4)
<span style="white-space:pre"> </span>if endswith(,) == True: return ','
<span style="white-space:pre"> </span>if endswith(,) == False:
<span style="white-space:pre"> </span>if endswith(the) == True: return 'AT'
<span style="white-space:pre"> </span>if endswith(the) == False:
<span style="white-space:pre"> </span>if endswith(s) == True:
<span style="white-space:pre"> </span>if endswith(is) == True: return 'BEZ'
<span style="white-space:pre"> </span>if endswith(is) == False: return 'VBZ'
<span style="white-space:pre"> </span>if endswith(s) == False:
<span style="white-space:pre"> </span>if endswith(.) == True: return '.'
<span style="white-space:pre"> </span>if endswith(.) == False: return 'NN'