sklearn.feature_extraction.text.TfidfVectorizer，文本TFIDF向量化类使用说明

最新推荐文章于 2022-06-17 14:52:59 发布

OrdinaryCrazy

最新推荐文章于 2022-06-17 14:52:59 发布

阅读量1.6k

点赞数

分类专栏： sklearn学习笔记

sklearn学习笔记专栏收录该内容

1 篇文章 0 订阅

订阅专栏

class sklearn.feature_extraction.text.TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>, norm=’l2’, use_idf=True, smooth_idf=True, sublinear_tf=False)

调用方法：from sklearn.feature_extraction.text import TfidfVectorizer

将原始文本集转换为TFIDF向量矩阵，相当于先进行文本向量化再进行TDIDF化。

参数说明：

1， input : string {‘filename’, ‘file’, ‘content’}

可以是需要处理的文件名称列表（filename），也可以是具体的一个文件（file），也可以是字符串（content）

2，encoding : string, ‘utf-8’ by default.

编码方式，说明输入文件的编码方式，默认为utf-8

3，decode_error : {‘strict’, ‘ignore’, ‘replace’}

4，strip_accents : {‘ascii’, ‘unicode’, None}

5，analyzer : string, {‘word’, ‘char’} or callable

6，preprocessor : callable or None (default)

7，tokenizer : callable or None (default)

8，ngram_range : tuple (min_n, max_n)

9，stop_words : string {‘english’}, list, or None (default)

10，lowercase : boolean, default True

11，token_pattern : string

12，max_df : float in range [0.0, 1.0] or int, default=1.0

词频上限，当输入整数值时不考虑出现次数多于给定次数的词，当输入0到1的浮点数值时看作词汇在文档中所占比例上限，如果前面给定了词典，这一参数将被忽略。

13，min_df : float in range [0.0, 1.0] or int, default=1

词频下限，当输入整数值时不考虑出现次数少于给定次数的词，当输入0到1的浮点数值时看作词汇在文档中所占比例下限，如果前面给定了词典，这一参数将被忽略。

14，max_features : int or None, default=None

15，vocabulary : Mapping or iterable, optional

16，binary : boolean, default=False

17，dtype : type, optional

18，norm : ‘l1’, ‘l2’ or None, optional

19，use_idf : boolean, default=True

20，smooth_idf : boolean, default=True

21，sublinear_tf : boolean, default=False

方法使用说明：

1，build_analyzer()

2，build_preprocessor()

3，build_tokenizer()

4，decode(doc)

5，fit(raw_documents[, y])

fit_transform(raw_documents, y=None)

6，fit_transform(raw_documents[, y])

7，get_feature_names()

8，get_params([deep])

9，get_stop_words()

10，inverse_transform(X)

11，set_params(**params)

12，transform(raw_documents[, copy])

OrdinaryCrazy

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
sklearn.feature_extraction.text.TfidfVectorizer，文本TFIDF向量化类使用说明

class sklearn.feature_extraction.text.TfidfVectorizer(input=’content’, encoding=’utf-8’, decode_error=’strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=’word’, ...
复制链接

扫一扫