sklearn中CountVectorizer与TfidfVectorizer区别

1.CountVectorizer

首先我们看看CountVectorizer相关源码中的部分内容。

class CountVectorizer(_VectorizerMixin, BaseEstimator):
    """Convert a collection of text documents to a matrix of token counts

    This implementation produces a sparse representation of the counts using
    scipy.sparse.csr_matrix.

    If you do not provide an a-priori dictionary and you do not use an analyzer
    that does some kind of feature selection then the number of features will
    be equal to the vocabulary size found by analyzing the data.

    Read more in the :ref:`User Guide <text_feature_extraction>`.

注释的前面两行就指出了CountVectorizer最核心的两点
Convert a collection of text documents to a matrix of token counts
CountVectorizer把一个文档转成一个包含词频的矩阵。
This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.
最后的词频矩阵是用csr_matrix这种稀疏矩阵的表示方式来表示的。

用一个简单的demo测试一下

from sklearn.feature_extraction.text import CountVectorizer

def t1():
    cv = CountVectorizer()
    train = ["Chinese Beijing Chinese",
             "Chinese Chinese Shanghai",
             "Chinese Macao",
             "Tokyo Japan Chinese"]
    cv_fit = cv.fit_transform(train)
    print(cv.get_feature_names())
    print(cv_fit)
    print(cv_fit.toarray())


t1()

最后的输出结果

['beijing', 'chinese', 'japan', 'macao', 'shanghai', 'tokyo']
  (0, 1)	2
  (0, 0)	1
  (1, 1)	2
  (1, 4)	1
  (2, 1)	1
  (2, 3)	1
  (3, 1)	1
  (3, 5)	1
  (3, 2)	1
[[1 2 0 0 0 0]
 [0 2 0 0 1 0]
 [0 1 0 1 0 0]
 [0 1 1 0 0 1]]

首先所有的文档中有6个词,所以最后get_feature_names得到的结果为6维列表。
cv_fit很明显可以看出来就是使用csr_matrix这种方式来存储的,(0,1)对应的是第一行第二个词即chinese,后面的2表示第一行chinese这个词出现了2次。
如果调用toarray方法,会将矩阵由稀疏表示转化为正常矩阵,因为所有文档中包含6个词,所以每一行文档会有6维。

2.TfidfVectorizer

class TfidfVectorizer(CountVectorizer):
    """Convert a collection of raw documents to a matrix of TF-IDF features.

    Equivalent to :class:`CountVectorizer` followed by
    :class:`TfidfTransformer`.

    Read more in the :ref:`User Guide <text_feature_extraction>`.

TfidfVectorizer跟CountVectorizer的区别在于:
CountVectorizer返回的是词频,TfidfVectorizer返回的是tfidf值。

from sklearn.feature_extraction.text import TfidfVectorizer

def t2():
    tf = TfidfVectorizer(use_idf=True, smooth_idf=True, norm=None)
    train = ["Chinese Beijing Chinese",
             "Chinese Chinese Shanghai",
             "Chinese Macao",
             "Tokyo Japan Chinese"]
    tf_fit = tf.fit_transform(train)
    print(tf.get_feature_names())
    print(tf_fit)
    print(tf_fit.toarray())


t2()
['beijing', 'chinese', 'japan', 'macao', 'shanghai', 'tokyo']
  (0, 0)	1.916290731874155
  (0, 1)	2.0
  (1, 4)	1.916290731874155
  (1, 1)	2.0
  (2, 3)	1.916290731874155
  (2, 1)	1.0
  (3, 2)	1.916290731874155
  (3, 5)	1.916290731874155
  (3, 1)	1.0
[[1.91629073 2.         0.         0.         0.         0.        ]
 [0.         2.         0.         0.         1.91629073 0.        ]
 [0.         1.         0.         1.91629073 0.         0.        ]
 [0.         1.         1.91629073 0.         0.         1.91629073]]

3.sklearn中idf的计算方法

TfidfVectorizer中计算tfidf值的核心代码调用如下

self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,
                                       smooth_idf=smooth_idf,
                                       sublinear_tf=sublinear_tf)

进入到TfidfTransformer中,查看源码观察具体计算逻辑

    def __init__(self, norm='l2', use_idf=True, smooth_idf=True,
                 sublinear_tf=False):
        self.norm = norm
        self.use_idf = use_idf
        self.smooth_idf = smooth_idf
        self.sublinear_tf = sublinear_tf

    def fit(self, X, y=None):
        """Learn the idf vector (global term weights)

        Parameters
        ----------
        X : sparse matrix, [n_samples, n_features]
            a matrix of term/token counts
        """
        X = check_array(X, accept_sparse=('csr', 'csc'))
        if not sp.issparse(X):
            X = sp.csr_matrix(X)
        dtype = X.dtype if X.dtype in FLOAT_DTYPES else np.float64

        if self.use_idf:
            n_samples, n_features = X.shape
            df = _document_frequency(X)
            df = df.astype(dtype, **_astype_copy_false(df))

            # perform idf smoothing if required
            df += int(self.smooth_idf)
            n_samples += int(self.smooth_idf)

            # log+1 instead of log makes sure terms with zero idf don't get
            # suppressed entirely.
            idf = np.log(n_samples / df) + 1
            self._idf_diag = sp.diags(idf, offsets=0,
                                      shape=(n_features, n_features),
                                      format='csr',
                                      dtype=dtype)

        return self

根据上面的代码不难看出,idf的具体计算方法为
当smooth_idf参数为true时
i d f = l o g 1 + n d 1 + d f + 1 idf = log \frac{1+n_d}{1+ df} + 1 idf=log1+df1+nd+1
其中, n d n_d nd为总文档数量,df为某个词出现的文档数量。
而当smooth_idf参数为false时
i d f = l o g n d d f + 1 idf = log \frac{n_d}{df} + 1 idf=logdfnd+1

4.csr_matrix解析

前面说到了csr_matrix表示方法,顺便温习一下csr_matrix相关知识点。
csr_matrix(Compressed Sparse Row matrix)为稀疏矩阵的一种表示方式,对应的是csc_matric(Compressed Sparse Column marix)。

CSR方法采取按行压缩的办法, 将原始的矩阵用三个数组进行表示

def csr_data():
    from scipy import sparse
    import numpy as np
    data = np.array([1, 2, 3, 4, 5, 6])
    indices = np.array([0, 2, 2, 0, 1, 2])
    indptr = np.array([0, 2, 3, 6])
    matrix = sparse.csr_matrix((data, indices, indptr), shape=(3, 3))
    print(matrix)
    print()
    print(matrix.todense())

csr_data()

结果为

  (0, 0)	1
  (0, 2)	2
  (1, 2)	3
  (2, 0)	4
  (2, 1)	5
  (2, 2)	6

[[1 0 2]
 [0 0 3]
 [4 5 6]]

其中,data为所有的非零数值
indices为所有非零值的列索引
indptr为每行的非零数据起止索引

def csc_data():
    from scipy import sparse
    import numpy as np
    data = np.array([1, 2, 3, 4, 5, 6])
    indices = np.array([0, 2, 2, 0, 1, 2])
    indptr = np.array([0, 2, 3, 6])
    matrix = sparse.csc_matrix((data, indices, indptr), shape=(3, 3))
    print(matrix)
    print()
    print(matrix.todense())

csc_data()

结果为

  (0, 0)	1
  (2, 0)	2
  (2, 1)	3
  (0, 2)	4
  (1, 2)	5
  (2, 2)	6

[[1 0 4]
 [0 0 5]
 [2 3 6]]

csc_matrix与csr_matrix唯一的区别在于,csr的indptr是针对行,而csc的indptr是针对列。

  • 2
    点赞
  • 22
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值