gensim中tfidf计算方法

最新推荐文章于 2024-02-25 15:17:58 发布

RessCris

最新推荐文章于 2024-02-25 15:17:58 发布

阅读量1.5k

点赞数 3

分类专栏： Python 文章标签： python 人工智能

本文链接：https://blog.csdn.net/weixin_41783424/article/details/122407815

版权

Python 专栏收录该内容

28 篇文章

订阅专栏

基本原理和思想呢，大概就是文档中词的重要性
与 TF：每个文档中的词的频率成正比
与 IDF：词在文档中出现的次数/总文档数的比例成反比

但是计算的过程中有很多变体，下面就举例介绍一下 gensim中的计算过程。

调用方式

from gensim import corpora, models, similarities

corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
          [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
          [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
          [(0, 1.0), (4, 2.0), (7, 1.0)],
          [(3, 1.0), (5, 1.0), (6, 1.0)],
          [(9, 1.0)],
          [(9, 1.0), (10, 1.0)],
          [(9, 1.0), (10, 1.0), (11, 1.0)],
          [(8, 1.0), (10, 1.0), (11, 1.0)]]

tfidf = models.TfidfModel(corpus)
vec = [(0, 1),(4,1)]

print(tfidf[vec])
>>> [(0, 0.8075244024440723), (4, 0.5898341626740045)]

步骤

1、计算 term 出现的频次
2、计算idf值
关注源码中的函数 df2idf 计算的是 idf 值。

def df2idf(docfreq, totaldocs, log_base=2.0, add=0.0):
    r"""Compute inverse-document-frequency for a term with the given document frequency `docfreq`:
    :math:`idf = add + log_{log\_base} \frac{totaldocs}{docfreq}`

    Parameters
    ----------
    docfreq : {int, float}
        Document frequency.
    totaldocs : int
        Total number of documents.
    log_base : float, optional
        Base of logarithm.
    add : float, optional
        Offset.

    Returns
    -------
    float
        Inverse document frequency.

    """
    return add + np.log(float(totaldocs) / docfreq) / np.log(log_base)

3、归一化
可以 ??gensim.mathutils.unitvec 进行查看。

代入

以开始的示例进行讲解


step1 :
	 termfreq = [(0,1),(4,1)]
step2:

	词项0， totaldocs = 9， docfreq = 2，log_base = 2
	idf  = np.log(float(totaldocs) / docfreq) / np.log(log_base) = 2.1699250014423126
	词项4， totaldocs = 9， docfreq = 3，log_base = 2
	idf  = np.log(float(totaldocs) / docfreq) / np.log(log_base) = 1.5849625007211563
	idf = [(0,2.1699250014423126),(4,1.5849625007211563 )] 
step3:
	l2 norm 归一化后 veclen = np.sqrt(np.sum(np.array([2.1699250014423126,1.5849625007211563]) ** 2)) = 2.6871324196207156
	最后得到 tfidf = [(0, 0.8075244024440723), (4, 0.5898341626740045)]