sklearn——朴素贝叶斯文本分类4

把数据去掉'headers', 'footers', 'quotes',准确率反而降低了

from sklearn.datasets import fetch_20newsgroups
news=fetch_20newsgroups(subset='all',remove=('headers', 'footers', 'quotes'))
from sklearn.cross_validation import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(news.data,news.target,test_size=0.25)
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer()
X_tfidf_train=tfidf.fit_transform(X_train)
X_tfidf_test=tfidf.transform(X_test)
from sklearn.naive_bayes import MultinomialNB
mnb_tfidf=MultinomialNB()
mnb_tfidf.fit(X_tfidf_train,Y_train)
print(mnb_tfidf.score(X_tfidf_test,Y_test))

去掉 'headers', 'footers', 'quotes'之后数据集就变成这样了

A "moment of silence" doesn't mean much unless *everyone*
participates.  Otherwise it's not silent, now is it?

Non-religious reasons for having a "moment of silence" for a dead
classmate: (1) to comfort the friends by showing respect to the
deceased , (2) to give the classmates a moment to grieve together, (3)
to give the friends a moment to remember their classmate *in the
context of the school*, (4) to deal with the fact that the classmate
is gone so that it's not disruptive later.

Blindly opposing everything with a flavor of religion in it is
utterly idiotic.
结果:

使用tfidf去掉停用词去掉开头结尾准确率
1010.6
1110.68
1000.85
1100.87

说明去掉 'headers', 'footers', 'quotes'效果更不好,不如留下来

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值