tf-idf 原理及实践

最新推荐文章于 2024-06-15 15:25:14 发布

ithinking110

最新推荐文章于 2024-06-15 15:25:14 发布

阅读量219

点赞数 1

分类专栏： nlp

本文链接：https://blog.csdn.net/ithinking110/article/details/105045651

版权

nlp 专栏收录该内容

19 篇文章 0 订阅

订阅专栏

tf-idf 原理及实践

TF（Term Frequency，缩写为TF）

也就是词频啦，即一个词在文中出现的次数

在这里插入图片描述

逆文档频率"（IDF）：

在这里插入图片描述
如果一个词越常见，那么分母就越大，逆文档频率就越小越接近0。分母之所以要加1，是为了避免分母为0（即所有文档都不包含该词）。log表示对得到的值取对

用统计学语言表达，就是在词频的基础上，要对每个词分配一个"重要性"权重，这个词越常见给予较小的权重，较少见的词给予较大的权重；个人理解在每个文档都能出现的词对这个文档的主题贡献越弱这是比较合理的。

TF-IDF

TF-IDF = TF * IDF

可以看到，TF-IDF与一个词在文档中的出现次数成正比，与该词在整个语言中的出现次数成反比。所以，自动提取关键词的算法就很清楚了，就是计算出文档的每个词的TF-IDF值，然后按降序排列，取排在最前面的几个词。

代码实现

注释已经加入代码要安装 gensim 包
得到每个词在每个文章中的 tfidf 为后面计算文章的主题等任务做准备


from gensim import corpora, models, similarities

#将所有的语料 放入一个list中 用逗号隔开  每一个逗号 表示一篇文章
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

#切割文章 变成 list 形式 [ [ 单词1， 单词2]，  [ 单词3 ，单词4]]
texts = [[word for word in document.lower().split()] for document in documents]

print("texts==",texts)

# 词典 将文章里面所有的词 按照 顺序  和 词对应起来
#dict= (0, 'abc')(1, 'applications')(2, 'computer')(3, 'for')

dictionary = corpora.Dictionary(texts)


for  k  in  dictionary.iteritems():
    print("dict=",k)
# 词库，以(词，词频)方式存贮
#(0, 1), (1, 1), (2, 1), (3, 1)
corpus = [dictionary.doc2bow(text) for text in texts]

print("corpus==",corpus)

#初始化语料
tfidf = models.TfidfModel(corpus)

#计算每个词的  tf-idf
corpus_tfidf = tfidf[corpus]

# (0, 0.39510679503439006), (1, 0.39510679503439006), (2, 0.270464478621662),
for doc in corpus_tfidf:
    print("doc",doc)

ithinking110

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
tf-idf 原理及实践

tf-idf 原理及实践TF（Term Frequency，缩写为TF）逆文档频率"（IDF）：TF-IDF代码实现TF（Term Frequency，缩写为TF）也就是词频啦，即一个词在文中出现的次数逆文档频率"（IDF）：如果一个词越常见，那么分母就越大，逆文档频率就越小越接近0。分母之所以要加1，是为了避免分母为0（即所有文档都不包含该词）。log表示对得到的值取对用统计学语言...
复制链接

扫一扫