TF-IDF词项权重计算

一、TF-IDF

词项频率:

df:term frequency。 term在文档中出现的频率.tf越大,词项越重要.

文档频率:

tf:document frequecy。有多少文档包含此term,df越大词项越不重要.

词项权重计算公式:

tf-idf=tf(t,d)*log(N/df(t))
 
 
  • 1
  • 1
  • W(t,d):the weight of the term in document d
  • tf(t,d):the frequency of term t in document d
  • N:the number of documents
  • df(t):the number of documents that contain term t

二、JAVA实现

package com.javacore.algorithm;

import java.util.Arrays;
import java.util.List;

/**
 * Created by bee on 17/3/13.
 * @version 1.0
 * @author blog.csdn.net/napoay
 */
public class TfIdfCal {



    /**
     *calculate the word frequency
     * @param doc word vector of a doc
     * @param term  a word
     * @return the word frequency of a doc
     */
    public double tf(List<String> doc, String term) {

        double termFrequency = 0;
        for (String str : doc) {
            if (str.equalsIgnoreCase(term)) {
                termFrequency++;
            }
        }
        return termFrequency / doc.size();
    }


    /**
     *calculate the document frequency
     * @param docs the set of all docs
     * @param term a word
     * @return the number of docs which contain the word
     */

    public int df(List<List<String>> docs, String term) {
        int n = 0;
        if (term != null && term != "") {

            for (List<String> doc : docs) {
                for (String word : doc) {
                    if (term.equalsIgnoreCase(word)) {
                        n++;
                        break;
                    }
                }
            }
        } else {
            System.out.println("term不能为null或者空串");
        }

        return n;
    }


    /**
     *calculate the inverse document frequency
     * @param docs  the set of all docs
     * @param term  a word
     * @return  idf
     */

    public double idf(List<List<String>> docs, String term) {

        System.out.println("N:"+docs.size());
        System.out.println("DF:"+df(docs,term));
        return  Math.log(docs.size()/(double)df(docs,term));
    }


    /**
     * calculate tf-idf
     * @param doc a doc
     * @param docs document set
     * @param term a word
     * @return inverse document frequency
     */
    public double tfIdf(List<String> doc, List<List<String>> docs, String term) {

        return tf(doc, term) * idf(docs, term);
    }


    public static void main(String[] args) {

        List<String> doc1 = Arrays.asList("人工", "智能", "成为", "互联网", "大会", "焦点");
        List<String> doc2 = Arrays.asList("谷歌", "推出", "开源", "人工", "智能", "系统", "工具");
        List<String> doc3 = Arrays.asList("互联网", "的", "未来", "在", "人工", "智能");
        List<String> doc4 = Arrays.asList("谷歌", "开源", "机器", "学习", "工具");
        List<List<String>> documents = Arrays.asList(doc1, doc2, doc3,doc4);


        TfIdfCal calculator = new TfIdfCal();

        System.out.println(calculator.tf(doc2, "开源"));
        System.out.println(calculator.df(documents, "开源"));
        double tfidf = calculator.tfIdf(doc2, documents, "谷歌");
        System.out.println("TF-IDF (谷歌) = " + tfidf);
        System.out.println(Math.log(4/2)*1.0/7);

    }


}

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99
  • 100
  • 101
  • 102
  • 103
  • 104
  • 105
  • 106
  • 107
  • 108
  • 109

运行结果:

0.14285714285714285
2
N:4
DF:2
TF-IDF (谷歌) = 0.09902102579427789
参考:http://blog.csdn.net/napoay/article/details/65449877
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值