把数据去掉'headers', 'footers', 'quotes',准确率反而降低了
from sklearn.datasets import fetch_20newsgroups
news=fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'))
from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(news.data,news.target,test_size=0.25)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
X_tfidf_train=tfidf.fit_transform(X_train)
X_tfidf_test=tfidf.transform(X_test)
from sklearn.naive_bayes import MultinomialNB
mnb_tfidf=MultinomialNB()
mnb_tfidf.fit(X_tfidf_train,Y_train)
print(mnb_tfidf.score(X_tfidf_test,Y_test))
去掉 'headers', 'footers', 'quotes'之后数据集就变成这样了
A "moment of silence" doesn't mean much unless *everyone* participates. Otherwise it's not silent, now is it? Non-religious reasons for having a "moment of silence" for a dead classmate: (1) to comfort the friends by showing respect to the deceased , (2) to give the classmates a moment to grieve together, (3) to give the friends a moment to remember their classmate *in the context of the school*, (4) to deal with the fact that the classmate is gone so that it's not disruptive later. Blindly opposing everything with a flavor of religion in it is utterly idiotic.结果:
使用tfidf | 去掉停用词 | 去掉开头结尾 | 准确率 |
1 | 0 | 1 | 0.6 |
1 | 1 | 1 | 0.68 |
1 | 0 | 0 | 0.85 |
1 | 1 | 0 | 0.87 |
说明去掉 'headers', 'footers', 'quotes'效果更不好,不如留下来