java tf-idf_TF-IDF词项权重计算

一、TF-IDF

词项频率:

df:term frequency。 term在文档中出现的频率.tf越大,词项越重要.

文档频率:

tf:document frequecy。有多少文档包含此term,df越大词项越不重要.

词项权重计算公式:

tf-idf=tf(t,d)*log(N/df(t))

1

1

W(t,d):the weight of the term in document d

tf(t,d):the frequency of term t in document d

N:the number of documents

df(t):the number of documents that contain term t

二、JAVA实现

package com.javacore.algorithm;

import java.util.Arrays;

import java.util.List;

/**

* Created by bee on 17/3/13.

* @version 1.0

* @author blog.csdn.net/napoay

*/

public class TfIdfCal {

/**

*calculate the word frequency

* @param doc word vector of a doc

* @param term a word

* @return the word frequency of a doc

*/

public double tf(List doc, String term) {

double termFrequency = 0;

for (String str : doc) {

if (str.equalsIgnoreCase(term)) {

termFrequency++;

}

}

return termFrequency / doc.size();

}

/**

*calculate the document frequency

* @param docs the set of all docs

* @param term a word

* @return the number of docs which contain the word

*/

public int df(List> docs, String term) {

int n = 0;

if (term != null && term != "") {

for (List doc : docs) {

for (String word : doc) {

if (term.equalsIgnoreCase(word)) {

n++;

break;

}

}

}

} else {

System.out.println("term不能为null或者空串");

}

return n;

}

/**

*calculate the inverse document frequency

* @param docs the set of all docs

* @param term a word

* @return idf

*/

public double idf(List> docs, String term) {

System.out.println("N:"+docs.size());

System.out.println("DF:"+df(docs,term));

return Math.log(docs.size()/(double)df(docs,term));

}

/**

* calculate tf-idf

* @param doc a doc

* @param docs document set

* @param term a word

* @return inverse document frequency

*/

public double tfIdf(List doc, List> docs, String term) {

return tf(doc, term) * idf(docs, term);

}

public static void main(String[] args) {

List doc1 = Arrays.asList("人工", "智能", "成为", "互联网", "大会", "焦点");

List doc2 = Arrays.asList("谷歌", "推出", "开源", "人工", "智能", "系统", "工具");

List doc3 = Arrays.asList("互联网", "的", "未来", "在", "人工", "智能");

List doc4 = Arrays.asList("谷歌", "开源", "机器", "学习", "工具");

List> documents = Arrays.asList(doc1, doc2, doc3,doc4);

TfIdfCal calculator = new TfIdfCal();

System.out.println(calculator.tf(doc2, "开源"));

System.out.println(calculator.df(documents, "开源"));

double tfidf = calculator.tfIdf(doc2, documents, "谷歌");

System.out.println("TF-IDF (谷歌) = " + tfidf);

System.out.println(Math.log(4/2)*1.0/7);

}

}

运行结果:

0.14285714285714285

2

N:4

DF:2

TF-IDF (谷歌) = 0.09902102579427789

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值