选择电影评论语料库,每个评论归类为正面和负面['neg', 'pos']。
from nltk.corpus import movie_reviews
import nltk
import random
documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
文档已经按类别标记完成。
movie_reviews.categories()
Out[13]: ['neg', 'pos']
接下来,为文档定义特征提取器,这样分类器就会知道应注意哪些方面的数据,对于文档主题识别,可以为每个词定义一个特性以表示该文档是否包含这个词。为了限制分类器需要处理的特征数目,构建整个语料库中出现次数前2000词的链表。然后定义一个特征提取器,简单地检查这些词是否在一个给定的文档中。
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains({})'.format(word)] = (word in document_words)
return features
训练和测试分类器以进行文档分类
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
output:
nltk.classify.accuracy(classifier,test_set)
Out[16]: 0.81
classifier.show_most_informative_features(5)
Most Informative Features
contains(unimaginative) = True neg : pos = 8.5 : 1.0
contains(shoddy) = True neg : pos = 7.1 : 1.0
contains(mena) = True neg : pos = 7.1 : 1.0
contains(suvari) = True neg : pos = 7.1 : 1.0
contains(atrocious) = True neg : pos = 6.7 : 1.0