文档的词频-反向文档频率(TF-IDF)计算

最新推荐文章于 2022-05-04 17:38:32 发布

小飞侠-2

最新推荐文章于 2022-05-04 17:38:32 发布

阅读量1.4k

点赞数

分类专栏：算法

算法专栏收录该内容

91 篇文章 4 订阅

订阅专栏

TF-IDF计算：

TF-IDF反映了在文档集合中一个单词对一个文档的重要性，经常在文本数据挖据与信息

提取中用来作为权重因子。在一份给定的文件里，词频(termfrequency-TF)指的是某一

个给定的词语在该文件中出现的频率。逆向文件频率（inversedocument frequency，

IDF）是一个词语普遍重要性的度量。某一特定词语的IDF，可以由总文件数目除以包含

该词语之文件的数目，再将得到的商取对数得到。

相关代码：

[java]view plaincopy 
   
 private static Pattern r = Pattern.compile("([ \\t{}()\",:;. \n])");   
 private static List<String> documentCollection;  
   
    //Calculates TF-IDF weight for each term t in document d  
    private static float findTFIDF(String document, String term)  
    {  
        float tf = findTermFrequency(document, term);  
        float idf = findInverseDocumentFrequency(term);  
        return tf * idf;  
    }  
   
    private static float findTermFrequency(String document, String term)  
    {  
     int count = getFrequencyInOneDoc(document, term);  
   
        return (float)((float)count / (float)(r.split(document).length));  
    }  
      
    private static int getFrequencyInOneDoc(String document, String term)  
    {  
     int count = 0;  
        for(String s : r.split(document))  
        {  
         if(s.toUpperCase().equals(term.toUpperCase())) {  
             count++;  
         }  
        }  
        return count;  
    }  
   
   
    private static float findInverseDocumentFrequency(String term)  
    {  
        //find the  no. of document that contains the term in whole document collection  
        int count = 0;  
        for(String doc : documentCollection)  
        {  
         count += getFrequencyInOneDoc(doc, term);  
        }  
        /* 
         * log of the ratio of  total no of document in the collection to the no. of document containing the term 
         * we can also use Math.Log(count/(1+documentCollection.Count)) to deal with divide by zero case;  
         */  
        return (float)Math.log((float)documentCollection.size() / (float)count);  
   
    }  

建立文档的向量空间模型Vector Space Model并计算余弦相似度。

相关代码：

[java]view plaincopy 
   
 public static float findCosineSimilarity(float[] vecA, float[] vecB)  
 {  
     float dotProduct = dotProduct(vecA, vecB);  
     float magnitudeOfA = magnitude(vecA);  
     float magnitudeOfB = magnitude(vecB);  
     float result = dotProduct / (magnitudeOfA * magnitudeOfB);  
     //when 0 is divided by 0 it shows result NaN so return 0 in such case.  
     if (Float.isNaN(result))  
         return 0;  
     else  
         return (float)result;  
 }  
   
 public static float dotProduct(float[] vecA, float[] vecB)  
 {  
   
     float dotProduct = 0;  
     for (int i = 0; i < vecA.length; i++)  
     {  
         dotProduct += (vecA[i] * vecB[i]);  
     }  
   
     return dotProduct;  
 }  
   
 // Magnitude of the vector is the square root of the dot product of the vector with itself.  
 public static float magnitude(float[] vector)  
 {  
     return (float)Math.sqrt(dotProduct(vector, vector));  
 }  

注意点：

零词过滤(stop-words filter)

小飞侠-2

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
文档的词频-反向文档频率(TF-IDF)计算

TF-IDF计算：TF-IDF反映了在文档集合中一个单词对一个文档的重要性，经常在文本数据挖据与信息提取中用来作为权重因子。在一份给定的文件里，词频(termfrequency-TF)指的是某一个给定的词语在该文件中出现的频率。逆向文件频率（inversedocument frequency，IDF）是一个词语普遍重要性的度量。某一特定词语的IDF，可以由总文件数目
复制链接

扫一扫

专栏目录