td-idf的理解

最新推荐文章于 2024-07-02 09:36:33 发布

jrymos001

最新推荐文章于 2024-07-02 09:36:33 发布

阅读量4.1k

点赞数 2

分类专栏：机器学习自然语言处理文章标签： td-idf

本文链接：https://blog.csdn.net/m0_37681914/article/details/73781494

版权

机器学习同时被 2 个专栏收录

10 篇文章

订阅专栏

自然语言处理

5 篇文章

订阅专栏

何为TF-IDF

TF-IDF是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。

假设有如下一篇文档集

文档1: Human machine
文档2: System human
---------------------------------
则有语料库(各个词在文档出现次数)为:   
单词\文档  文档1     文档2    总计
human     1(a表示)  1       2(c表示)
machine   1        0       1
system    0        1       1
总计词     2(b表示)  2    文档总数为2(d表示)
------------------------------
td-idf计算什么? : 如计算human在文档1中的权重
td-idf计算公式为: 该文档词频 * 逆向文件频率
如human在文档1的td-idf为:0.5 * 0 = 0,其中:
词频 = a/b = 1/2=0.5
逆向文件频率 = log(d/c) = log(1) = 0
又如machine在文档1的td-idf为: 0.5 * log(2) = 0.35
-------------------------------
本文档集的td-idf为:
单词\文档   文档1      文档2
human      0         0
machine    0.35      0       
system     0         0.35

gensim中从一个文档集中获得一个td-idf的代码如下:

关于gensim语料库的介绍,可参考
http://blog.csdn.net/m0_37681914/article/details/73744685

#文档集,每一行表示一篇文档
documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
texts = [[token for token in text if frequency[token] > 1]
          for text in texts]
from gensim import corpora
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
#------------以上代码均是为了获取语料库corpus--------------------
#------------以下步入正题-------------
#获取文档集的tf-idf,得到的tfidf即是一个不可变的权重模板
tfidf = models.TfidfModel(corpus)
#用tfidf模板来计算文档doc_bow(矢量空间表示)的td-idf值
doc_bow = [(0, 1), (1, 1)]
print(tfidf[doc_bow])
#对整个语料库应用转换
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)