import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features=list(all_words.keys())[:2000] #[word for (word, freq) in all_words.most_common(2000)]
def document_features(document):
document_words = set(document)
features = {}#
for word in word_features:
features['contains(%s)'% word] = (word in document_words)
return features
print(movie_reviews.words('pos/cv957_8737.txt'))
print(document_features(movie_reviews.words('pos/cv957_8737.txt')))
这个来自于自然语言处理-Python第211页的一个例子。书是基于Python2写的,所以修改了一个地方word_features=list(all_words.keys())[:2000]
代码主要是检测目标文档中是否包含movie_reviews.words中的前2000 个词,包含的话返回true。
遇到的问题主要有1、 TypeError: 'dict_keys' object is not subscriptable
改成word_features=list(all_words.keys())[:2000]报错消失
2.return features缩进导致代码在那个地方break。返回的就只有一个特征的判断。修改后问题解决。