scikit-learn:0.3. 从文本文件中提取特征(tf、tf-idf)、训练一个分类器

上一篇讲了如何加载数据。

本篇参考:http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

主要讲解如下部分:

Extracting features from text files

Training a classifier




跑模型之前,需要将文本文件的内容转换为数字特征向量。常见的是tf、tf-idf。


1、tf:

首先解决high-dimensional sparse datasetsscipy.sparse matrices就是解决这个问题,scikit-learn 已经内置了该数据结构(built-in support for these structures)。

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.<strong>fit_transform</strong>(rawData.data)

X_train_counts
Out[43]: 
<6x11 sparse matrix of type '<type 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

X_train_counts.shape
Out[44]: (6, 11)

print count_vect.vocabulary_.get(u'like')
print count_vect.vocabulary_.get(u'good')
3
1

print rawData_counts
  (0, 8)        1
  (0, 0)        1
  (0, 3)        1
  (1, 8)        1
  (1, 3)        1
  (1, 10)       1
  (1, 9)        1
  (2, 8)        1
  (2, 4)        1
  (3, 8)        1
  (3, 6)        1
  (3, 1)        1
  (4, 8)        1
  (4, 2)        1
  (5, 8)        1
  (5, 1)        1
  (5, 5)        1
  (5, 7)        1




2、tf-idf:

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.<strong>fit_transform</strong>(rawData_counts)
X_train_tfidf.shape
Out[53]: (6, 11)

X_train_tfidf
Out[54]: 
<6x11 sparse matrix of type '<type 'numpy.float64'>'
	with 18 stored elements in Compressed Sparse Row format>

print X_train_tfidf
  (0, 3)        0.599738830611
  (0, 0)        0.731376058697
  (0, 8)        0.324657351406
  (1, 9)        0.590335838052
  (1, 10)       0.590335838052
  (1, 3)        0.484083832074
  (1, 8)        0.262049690228
  (2, 4)        0.913996360826
  (2, 8)        0.405722383406
  (3, 1)        0.599738830611
  (3, 6)        0.731376058697
  (3, 8)        0.324657351406
  (4, 2)        0.913996360826
  (4, 8)        0.405722383406
  (5, 7)        0.590335838052
  (5, 5)        0.590335838052
  (5, 1)        0.484083832074
  (5, 8)        0.262049690228



3、训练一个分类器:

以naive bayes为例:

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, rawData.target)


4、预测:

新文件来了,需要进行完全相同的特征提取过程。不同之处是,我们使用“transform instead of fit_transform on the transformers,因为我们已经在训练集上fit了:

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, rawData.target)
docs_new = ['i like this', 'haha, start.']
X_new_counts = count_vect.<strong>transform</strong>(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, rawData.target_names[category]))
'i like this' => category_2_folder
'haha, start.' => category_1_folder
看来简单预测还是比较准确的啊。。。。








Extracting features from text files
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值