sklearn 文本特征提取 CountVectorizer基础使用

最新推荐文章于 2024-01-04 23:53:22 发布

LLOJVQE

最新推荐文章于 2024-01-04 23:53:22 发布

阅读量1.2k

点赞数

分类专栏： Python基础文章标签： python 字符串自然语言处理

本文链接：https://blog.csdn.net/weixin_41989712/article/details/107454577

版权

Python基础专栏收录该内容

16 篇文章 0 订阅

订阅专栏

Python 机器学习文本处理中会需要统计词频，预处理删除一些无用词汇，所以CountVectorizer还是经常会使用的。

class sklearn.feature_extraction.text.CountVectorizer(*, input=‘content’, encoding=‘utf-8’, decode_error=‘strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=’(?u)\b\w\w+\b’, ngram_range=(1, 1), analyzer=‘word’, max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64’>)

乍一看参数还是挺多的，但是一般需要自定义的不多

参数	解释
input	文件或者是含有字符串的list之类的
encoding	默认utf-8
decode_error	如果读到了不是encoding里规定的字节:'strict’报错，'ignore’忽略, 'replace’替换
strip_accents	{‘ascii’, ‘unicode’}, 去除音调，默认None 例：使用’ascii’可以将’â’替换为’a’
lowercase	将所有字母转化为小写的预操作，默认True操作
preprocessor	覆写the preprocessing (string transformation) stage，但保留tokenizing and n-grams generation steps. Only applies if `analyzer is not callable`
tokenizer	覆写the string tokenization step，但保留preprocessing and n-grams generation steps. Only applies if `analyzer == 'word'`
stop_words	去除文本里的stop word(比如and,him,is 这种通常对于分析无意义的), 'english’使用内置的一个停顿词list, 也可使用自定义的list, None就是不删除(但可以使用后面的max_df/min_df参数删除一定频率的词)
token_pattern	正则表达式，默认筛选长度大于等于2的字母和数字混合字符，Only applies if `analyzer == 'word'` 例：筛选出所有的字母数字和空格@混合字符并且长度大于等于3小于等于5 `token_pattern='[\w\d\s@]{3,5}'`
ngram_range	词组的切分长度范围例如(1,2)代表unigram和bigram切割，例如一句话:`'I have a pen'`unigram:`'I','have','a','pen'`, bigram:`'I have','a pen'`
analyzer	设定feature是word 或character 也可以是外部引入的文件
max_df	可以设置为范围在[0.0 1.0]的float，也可以设置为没有范围限制的int，默认为1.0。这个参数的作用是作为一个阈值，当构造语料库的关键词集的时候，如果某个词的document frequence大于max_df，这个词不会被当作关键词（过于频繁被视为无用的特征）。如果这个参数是float，则表示词在文档出现的次数百分比，如果是int，则表示词出现的准确次数。如果参数中已经给定了vocabulary，则这个参数无效
min_df	参考max_df，最小阈值
max_features	默认为None，当设置具体数值时，term做降序排序只取前max_features个
vocabulary	默认为None，一个字典或可迭代对象，意思就是特征已经被设定在了这个范围里, 这个vocabulary是一个key为关键词value为索引的字典, 索引是从0到n的没有重复或缺失。如果没有设定，则自动从输入文档中构建关键词集
binary	默认False，如果为True则左右出现次数非零的都会被设为1，这对需要布尔值输入的离散概率模型很有用
dtype	使用`fit_transform()` 或 `transform()`后返回的文档词频矩阵数值类型

CountVectorizer 有三个属性：

vocabulary_：dict
返回一个关键词到索引的映射字典 key: term关键词 value: feature indices索引（注意字典中value的数字并不是关键词出现次数，索引也与词汇顺序无绝对关系）
fixed_vocabulary_：boolean
True if a fixed vocabulary of term to indices mapping is provided by the user
stop_words_：set
返回被忽略的关键词，可能被忽视的原因有：

出现次数超过 max_df
出现次数小于 min_df
特征选择max_features被cut掉了
这个属性只在vocabulary没有被指定的时候有效

CountVectorizer 几个常用方法:
get_feature_names(): 返回 list 词汇表
get_stop_words(): 返回 list 停用词表

示例

from sklearn.feature_extraction.text import CountVectorizer

texts=["I love cat","she loves dog","I have a cat and a cat"] # texts代表一个文章的输入字符串

cv = CountVectorizer() # 创建词袋数据结构
cv_fit=cv.fit_transform(texts)

print(cv.vocabulary_)
# {'love': 4, 'cat': 1, 'she': 6, 'loves': 5, 'dog': 2, 'have': 3, 'and': 0}
print(cv_fit)
'''
  (0, 1)	1
  (0, 4)	1
  (1, 2)	1
  (1, 5)	1
  (1, 6)	1
  (2, 0)	1
  (2, 3)	1
  (2, 1)	2
'''
print(cv_fit.toarray())
'''
[[0 1 0 0 1 0 0]
 [0 0 1 0 0 1 1]
 [1 2 0 1 0 0 0]]
 '''

官方文档传送门：这里

LLOJVQE

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
sklearn 文本特征提取 CountVectorizer基础使用

Python 机器学习文本处理中会需要统计词频，预处理删除一些无用词汇，所以CountVectorizer还是经常会使用的。class sklearn.feature_extraction.text.CountVectorizer(*, input=‘content’, encoding=‘utf-8’, decode_error=‘strict’, strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_w
复制链接

扫一扫

专栏目录