scikit-learn：0.3. 从文本文件中提取特征（tf、tf-idf）、训练一个分类器

最新推荐文章于 2022-12-27 21:14:11 发布

mmc2015

最新推荐文章于 2022-12-27 21:14:11 发布

阅读量4.6k

点赞数 1

分类专栏： scikit-learn 机器学习——文本挖掘

scikit-learn 同时被 2 个专栏收录

51 篇文章 2 订阅

订阅专栏

机器学习——文本挖掘

30 篇文章 1 订阅

订阅专栏

上一篇讲了如何加载数据。

本篇参考：http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

主要讲解如下部分：

Extracting features from text files

Training a classifier

跑模型之前，需要将文本文件的内容转换为数字特征向量。常见的是tf、tf-idf。

1、tf：

首先解决high-dimensional sparse datasets：scipy.sparse matrices就是解决这个问题，scikit-learn 已经内置了该数据结构（built-in support for these structures）。

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.<strong>fit_transform</strong>(rawData.data)

X_train_counts
Out[43]: 
<6x11 sparse matrix of type '<type 'numpy.int64'>'
	with 18 stored elements in Compressed Sparse Row format>

X_train_counts.shape
Out[44]: (6, 11)

print count_vect.vocabulary_.get(u'like')
print count_vect.vocabulary_.get(u'good')
3
1

print rawData_counts
  (0, 8)        1
  (0, 0)        1
  (0, 3)        1
  (1, 8)        1
  (1, 3)        1
  (1, 10)       1
  (1, 9)        1
  (2, 8)        1
  (2, 4)        1
  (3, 8)        1
  (3, 6)        1
  (3, 1)        1
  (4, 8)        1
  (4, 2)        1
  (5, 8)        1
  (5, 1)        1
  (5, 5)        1
  (5, 7)        1

2、tf-idf：

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.<strong>fit_transform</strong>(rawData_counts)
X_train_tfidf.shape
Out[53]: (6, 11)

X_train_tfidf
Out[54]: 
<6x11 sparse matrix of type '<type 'numpy.float64'>'
	with 18 stored elements in Compressed Sparse Row format>

print X_train_tfidf
  (0, 3)        0.599738830611
  (0, 0)        0.731376058697
  (0, 8)        0.324657351406
  (1, 9)        0.590335838052
  (1, 10)       0.590335838052
  (1, 3)        0.484083832074
  (1, 8)        0.262049690228
  (2, 4)        0.913996360826
  (2, 8)        0.405722383406
  (3, 1)        0.599738830611
  (3, 6)        0.731376058697
  (3, 8)        0.324657351406
  (4, 2)        0.913996360826
  (4, 8)        0.405722383406
  (5, 7)        0.590335838052
  (5, 5)        0.590335838052
  (5, 1)        0.484083832074
  (5, 8)        0.262049690228

3、训练一个分类器：

以naive bayes为例：

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, rawData.target)

4、预测：

新文件来了，需要进行完全相同的特征提取过程。不同之处是，我们使用“transform instead of fit_transform on the transformers”，因为我们已经在训练集上fit了：

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, rawData.target)
docs_new = ['i like this', 'haha, start.']
X_new_counts = count_vect.<strong>transform</strong>(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, rawData.target_names[category]))
'i like this' => category_2_folder
'haha, start.' => category_1_folder

看来简单预测还是比较准确的啊。。。。

Extracting features from text files

mmc2015

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
2
评论
scikit-learn：0.3. 从文本文件中提取特征（tf、tf-idf）、训练一个分类器

上一篇讲了如何加载数据。本篇参考：http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html主要讲解如下部分：Extracting features from text filesTraining a classifier跑模型之前，需要将文本文件的
复制链接

扫一扫