sklearn之sklearn.feature_extraction.text.CountVectorizer

最新推荐文章于 2023-09-27 20:48:36 发布

VIP文章 conquerorjia

最新推荐文章于 2023-09-27 20:48:36 发布

阅读量1w

点赞数 6

分类专栏： sklearn machine learning

本文链接：https://blog.csdn.net/conquerorjia/article/details/24963177

版权

最近做文本文档分类，用sklearn里提供的算法包，看到一些程序里用的函数或者类不是很懂，就自己查了一下，就来一个翻译帖吧，把自己看的sklearn中的内容翻译一下，可能翻译的不好，还请路过的大神多多指教啊。

sklearn.feature_extraction.text.CountVectorizer是一个特征提取模块里面的一个类，下面算是原文的翻译：

Convert a collection of text documents to a matrix of token counts（它将收集到的文本文档数据集转化成单词矩阵）。

This implementation produces a sparse representation of the counts using scipy.sparse.coo_matrix（它利用scipy.sparse.coo_matrix来实现来对单次进行计数，最终表示成一种稀疏矩阵的形式）

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.（如果你不提供a-priori字典并且你并不需要对特性选择进行分析，那么特征的数量就等于被分析文档的词汇数量的大小）

函数的输入参数：

input：string {‘filename’, ‘file’, ‘content’}

If filename, the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.（如果是filename，那么传递到函数中的应该是一系列要分析的文件，通过读取文件的原始内容进行分析）

If ‘file’, the sequence items must have ‘read’ method (file-like object) it is called to fetch the bytes in memory.（如果是file，那么file应该支持读，来获取文件的字节内容）

Otherwise the input is expected to be the sequence strings or bytes items are expected to be analyzed directly.（否则，输入的应该是可以直接分析的字符串或者字节项）

encoding : string, ‘utf-8’ by default.