原标题:利用sklearn做文本分类(特征提取、knn/svm聚类)
数据挖掘入门与实战 公众号: datadw
分为以下几个过程:
加载数据集
提feature
分类
Naive Bayes
KNN
SVM
聚类
http://qwone.com/~jason/20Newsgroups/
上给出了3个数据集,这里我们用最原始的
1.加载数据集
从下载数据集,解压到scikit_learn_data文件夹下,加载数据,详见code注释。
[python]
#first extract the 20 news_group dataset to /scikit_learn_data
fromsklearn.datasets importfetch_20newsgroups
#all categories
#newsgroup_train = fetch_20newsgroups(subset='train')
#part categories
categories = ['comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x'];
newsgroup_train = fetch_20newsgroups(subset = 'train',categories = categories);
可以检验是否load好了:
[python]
#print category names
frompprint importpprint
pprint(list(newsgroup_train.target_names))结果:
['comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x']
2. 提feature:
刚才load进来的newsgroup_train就是一篇篇document,我们要从中提取feature,即词频啊神马的,用fit_transform
Method 1. HashingVectorizer,规定feature个数
[python]
#newsgroup_train.data is the original documents, but we need to extract the
#feature vectors inorder to model the text data
fromsklearn.feature_extraction.text importHashingVectorizer
vectorizer = HashingVectorizer(stop_words = 'english',non_negative = True,
n_features = 10000)
fea_train = vectorizer.fit_transform(newsgroup_train.data)
fea_test = vectorizer.fit_transform(newsgroups_tes