<tf-idf + 余弦相似度> 计算文章的相似度

最新推荐文章于 2024-07-16 17:29:04 发布

weixin_30723433

最新推荐文章于 2024-07-16 17:29:04 发布

阅读量180

点赞数

原文链接：http://www.cnblogs.com/wxiaoli/p/6940702.html

版权

背景知识:

（1）tf-idf

按照词TF-IDF值来衡量该词在该文档中的重要性的指导思想：如果某个词比较少见，但是它在这篇文章中多次出现，那么它很可能就反映了这篇文章的特性，正是我们所需要的关键词。

tf–idf is the product of two statistics, term frequency and inverse document frequency.

//Various ways for determining the exact values of both statistics exist.

tf–idf= tf×idf

In the case of the term frequency tf(t,d), the simplest choice is to use the raw frequency of a term in a document, i.e. the number of times that term t occurs in document d.

Other possibilities include:

- Boolean "frequencies": tf(t,d) = 1 if t occurs in d and 0 otherwise;

- logarithmically scaled frequency: tf(t,d) = 1 + log f_t,d, or zero if ft,d is zero;

- augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency divided by the maximum raw frequency of any term in the document:

tf(t,d)=0.5+0.5*f_t,d/max(f_t'd)

The inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents.

（2）余弦相似度

余弦值的范围在[-1,1]之间，值越趋近于1，代表两个向量的方向越接近；越趋近于-1，他们的方向越相反；接近于0，表示两个向量近乎于正交。

一般情况下，相似度都是归一化到[0,1]区间内，因此余弦相似度表示为cosineSIM=0.5cosθ+0.5

计算过程：

（1）使用TF-IDF算法，找出两篇文章的关键词；

（2）每篇文章各取出若干个关键词（为公平起见，一般取的词数相同），合并成一个集合，计算每篇文章对于这个集合中的词的词频

（注1：为了避免文章长度的差异，可以使用相对词频；注2：这一步选出的不同词的数量决定了词频向量的长度）；

（3）生成两篇文章各自的词频向量（注：所有文章对应的词频向量等长，相同位置的元素对应同一词）；

（4）计算两个向量的余弦相似度，值越大就表示越相似。

Note that: tf-idf值只在第一步用到。

举例说明：

文章A：我喜欢看小说。

文章B：我不喜欢看电视，也不喜欢看电影。

    第一步： 分词 
  
        文章A：我/喜欢/看/小说。 
  
        文章B：我/不/喜欢/看/电视，也/不/喜欢/看/电影。 
  
    第二步，列出所有的词。 
  
         我，喜欢，看，小说，电视，电影，不，也。 
  
    第三步，计算每个文档中各个词的词频tf。 
  
    　　文章A：我 1，喜欢 1，看 1，小说 1，电视 0，电影 0，不 0，也 0。 
  
    　　文章B：我 1，喜欢 2，看 2，小说 0，电视 1，电影 1，不 2，也 1。 
  
    第四步，计算各个词的逆文档频率idf。 
  
    　　我 log(2/2)=0，喜欢 log(2/2)=0，看 log(2/2)=0，小说 log(2/1)=1，电视 log(2/1)=1，电影 log(2/1)=1，不 log(2/1)=1，也 log(2/1)=1。 
  
    第五步：计算每个文档中各个词的tf-idf值 
  
    　　文章A：我 0，喜欢 0，看 0，小说 1，电视 0，电影 0，不 0，也 0。 
  
    　　文章B：我 0，喜欢 0，看 0，小说 0，电视 1，电影 1，不 1，也 1。 
  
    第六步：选择每篇文章的关键词（这里选tf-idf排名前3的词作为关键词（至于并列大小的随机选）） 
  
    　　文章A：我 0，喜欢 0，小说 1 
  
    　　文章B：电视 1，电影 1，不 1 
  
    第七步：构建用于计算相似度的词频向量（根据上一步选出的词：我，喜欢，小说，电视，电影，不） 
  
       文章A：[1 1 1 0 0 0] 
  
       文章B： [1 2 0 1 1 2] 
  
    第八步：计算余弦相似度值 
  
         cosθ=3/sqrt(33)= 0.5222329678670935 
  
    　　　cosineSIM(A，B)=0.5222329678670935*0.5+0.5=0.7611164839335467

references：

(1) https://en.wikipedia.org/wiki/Tf%E2%80%93idf

(2) http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html

转载于:https://www.cnblogs.com/wxiaoli/p/6940702.html

weixin_30723433

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
<tf-idf + 余弦相似度> 计算文章的相似度

背景知识:（1）tf-idf按照词TF-IDF值来衡量该词在该文档中的重要性的指导思想：如果某个词比较少见，但是它在这篇文章中多次出现，那么它很可能就反映了这篇文章的特性，正是我们所需要的关键词。tf–idf is the product of two statistics, term frequency and inverse document frequency....
复制链接

扫一扫