老卫带你学---sklearn的CountVectorizer()类解析

最新推荐文章于 2020-10-27 09:46:06 发布

老卫带你学

最新推荐文章于 2020-10-27 09:46:06 发布

阅读量735

点赞数 1

分类专栏：深度学习机器学习--数学基础

本文链接：https://blog.csdn.net/yixieling4397/article/details/100540077

版权

深度学习同时被 2 个专栏收录

44 篇文章 0 订阅

订阅专栏

机器学习--数学基础

10 篇文章 1 订阅

订阅专栏

它主要是把新的文本转化为特征矩阵，只不过，这些特征是已经确定过的。而这个特征序列是前面的fit_transfome()输入的语料库确定的特征。见例子：

>>>from sklearn.feature_extraction.text import CountVectorizer
>>>vec=CountVectrizer()
>>>vec.transform(['Something completely new.']).toarray()

错误返回，sklearn.exceptions.NotFittedError: CountVectorizer - Vocabulary wasn’t fitted.表示没有对应的词汇表，这个文本无法转换。其实就是没有建立vocabulary表，没法对文本按照矩阵索引来统计词的个位数

corpus = [
     'This is the first document.',
    'This is the second second document.',
   'And the third one.',
   'Is this the first document?']
X = vec.fit_transform(corpus)
X.toarray(

vocabulary列表

>>>vec.get_feature_names()
 ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

得到的稀疏矩阵是

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

建立vocabulary后可以用transform（）来对新文本进行矩阵化了

>>>vec.transform(['this is']).toarray()
 array([[0, 0, 0, 1, 0, 0, 0, 0, 1]], dtype=int64)
>>>vec.transform(['too bad']).toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

简单分析’this is’在vocabulary表里面，则对应词统计数量，形成矩阵。而’too bad’在vocabulary表中没有这两词，所以矩阵都为0.

老卫带你学

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
老卫带你学---sklearn的CountVectorizer()类解析

它主要是把新的文本转化为特征矩阵，只不过，这些特征是已经确定过的。而这个特征序列是前面的fit_transfome()输入的语料库确定的特征。见例子：>>>from sklearn.feature_extraction.text import CountVectorizer>>>vec=CountVectrizer()>>>vec.tra...
复制链接

扫一扫