scikit-learn：CountVectorizer提取tf都做了什么

最新推荐文章于 2025-04-27 16:33:21 发布

mmc2015

最新推荐文章于 2025-04-27 16:33:21 发布

阅读量2.6w

点赞数 11

分类专栏：机器学习——文本挖掘 scikit-learn scikit-learn 文章标签： scikit-learn CountVectorizer详解特征提取

本文链接：https://blog.csdn.net/mmc2015/article/details/46866537

版权

本文详细探讨了scikit-learn库中的CountVectorizer如何从文本中提取特征，尤其是它如何计算词频（Term Frequency, TF），这是文本挖掘和自然语言处理中的重要步骤。CountVectorizer通过对文本进行分词、建立词汇表并转换为词频矩阵来实现这一过程。" 110526019,10296476,Seaborn lmplot教程：Python回归图绘制,"['Python可视化', '统计图形', '数据科学', 'Seaborn库', '回归分析']

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

class sklearn.feature_extraction.text. CountVectorizer ( input=u'content', encoding=u'utf-8', decode_error=u'strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=u'(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer=u'word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<type 'numpy.int64'> ) [source]

作用：Convert a collection of text documents to a matrix of token counts（计算词汇的数量，即tf）；结果由 scipy.sparse.coo_matrix进行稀疏表示。

看下参数就知道CountVectorizer在提取tf时都做了什么：

strip_accents : {‘ascii’, ‘unicode’, None}：是否除去“音调”，不知道什么是“音调”？看：http://textmechanic.com/?reqp=1&reqr=nzcdYz9hqaSbYaOvrt==

lowercase : boolean, True by default：计算tf前，先将所有字符转化为小写。这个参数一般为True。

preprocessor : callable or None (default)：复写the preprocessing (string transformation) stage，但保留tokenizing and n-grams generation steps.这个参数可以自己写。