CountVectorizer和TfidfVectorizer对比

最新推荐文章于 2023-04-23 22:18:23 发布

weixin_43351935

最新推荐文章于 2023-04-23 22:18:23 发布

阅读量403

点赞数

分类专栏： pandas 文章标签： sklearn python

本文链接：https://blog.csdn.net/weixin_43351935/article/details/128801759

版权

pandas 专栏收录该内容

16 篇文章 1 订阅

订阅专栏

CountVectorizer与TfidfVectorizer，这两个类都是特征数值计算的常见方法。
对于每一个训练文本，CountVectorizer只考虑每种词汇在该训练文本中出现的频率，
而TfidfVectorizer除了考量某一词汇在当前训练文本中出现的频率之外，
同时关注包含这个词汇的其它训练文本数目的倒数。相比之下，训练文本的数量越多，TfidfVectorizer这种特征量化方式就更有优势。

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
      'This is the first document.',
      'This document is the second document.',
      'And this is the third one.',
     'Is this the first document?',
  ]
vectorizer = CountVectorizer( min_df=1, ngram_range=(1, 1) )  # ##创建词袋数据结构,里面相应参数设置
X = vectorizer.fit_transform(corpus) #  #拟合模型，并返回文本矩阵
vectorizer.get_feature_names_out()  #  #显示所有文本的词汇，列表类型 
vectorizer.vocabulary_   # #词汇表，字典类型  key: 词，value : 序号索引 
X.toarray()  # #.toarray() 是将结果转化为稀疏矩阵

print( X  ) #文本矩阵  # 对应 上面的系数矩阵的 
# (0, 8)	1  表示 (行，列)  词频为1  
# 具体代表的是哪个字 
# 文本的词汇索引为8 的this  也正好 是 vectorizer.vocabulary_ 中value 为8的 key
print(X.toarray().sum(axis=0)) #统计每个词在所有文档中的词频
## 以上就是 得到 对应词频的 系数矩阵

在这里插入图片描述

在这里插入图片描述
https://blog.csdn.net/blmoistawinde/article/details/80816179
sklearn: TfidfVectorizer 中文处理及一些使用参数

使用 TfidfVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
      'This is the first document.',
      'This document is the second document.',
      'And this is the third one.',
     'Is this the first document?',
  ]
tfidf = TfidfVectorizer( min_df=1, ngram_range=(1, 1) )
features = tfidf.fit_transform(corpus)
pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names_out()
)

在这里插入图片描述

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer( min_df=1, ngram_range=(1, 1)  ) # 写参数 
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names_out()
vectorizer.get_feature_names_out()
X.toarray()

在这里插入图片描述

参考学习链接

# https://medium.com/grabngoinfo/topic-modeling-with-deep-learning-using-python-bertopic-cf91f5676504
# BERTopic   Topic Modeling with Deep Learning Using Python BERTopic 


# https://towardsdatascience.com/a-friendly-introduction-to-text-clustering-fa996bcefd04 
# A Friendly Introduction to Text Clustering 


# Clustering Product Names with Python — Part 1  
## https://towardsdatascience.com/clustering-product-names-with-python-part-1-f9418f8705c8 

#Clustering Product Names with Python — Part  2
# https://towardsdatascience.com/clustering-product-names-with-python-part-2-648cc54ca2ac 

# https://github.com/Derekkk/Brown-Word-Clustering-and-word-similarity

# http://brandonrose.org/clustering 

#  Text Analytics for Beginners using Python spaCy Part-2 
# https://machinelearninggeek.com/text-analytics-for-beginners-using-python-spacy-part-2/

# https://machinelearninggeek.com/text-classification-using-python-spacy/  使用文本有监督式 分类

weixin_43351935

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
CountVectorizer和TfidfVectorizer对比

CountVectorizer与TfidfVectorizer，这两个类都是特征数值计算的常见方法。对于每一个训练文本，CountVectorizer只考虑每种词汇在该训练文本中出现的频率，而TfidfVectorizer除了考量某一词汇在当前训练文本中出现的频率之外，同时关注包含这个词汇的其它训练文本数目的倒数。相比之下，训练文本的数量越多，TfidfVectorizer这种特征量化方式就更有优势。sklearn: TfidfVectorizer 中文处理及一些使用参数。
复制链接

扫一扫