spark中TF-IDF的理解及其使用

一. 什么是TF-IDF

TF-IDF(Term Frequency-Inverse Document Frequency, 词频-逆文件频率).

是一种用于资讯检索与资讯探勘的常用加权技术。TF-IDF是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。

上述引用总结就是, 一个词语在一篇文章中出现次数越多, 同时在所有文档中出现次数越少, 越能够代表该文章.

这也就是TF-IDF的含义.

词频 (term frequency, TF) 指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被归一化(一般是词频除以文章总词数), 以防止它偏向长的文件。(同一个词语在长文件里可能会比短文件有更高的词频,而不管该词语重要与否。)

但是, 需要注意, 一些通用的词语对于主题并没有太大的作用, 反倒是一些出现频率较少的词才能够表达文章的主题, 所以单纯使用是TF不合适的。权重的设计必须满足:一个词预测主题的能力越强,权重越大,反之,权重越小。所有统计的文章中,一些词只是在其中很少几篇文章中出现,那么这样的词对文章的主题的作用很大,这些词的权重应该设计的较大。IDF就是在完成这样的工作.

公式:

TFw=wTFw=在某一类中词条w出现的次数该类中所有的词条数目

逆向文件频率 (inverse document frequency, IDF)   IDF的主要思想是:如果包含词条t的文档越少, IDF越大,则说明词条具有很好的类别区分能力。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到。

公式: 

IDF=log(w+1),10IDF=log(语料库的文档总数包含词条w的文档数+1),分母之所以要加1,是为了避免分母为0

  某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。 
  

TFIDF=TFIDFTF−IDF=TF∗IDF

二. 一个实例

参考: http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

这里写图片描述

三. spark原文参考

(http://spark.apache.org/docs/latest/ml-features.html)

TF-IDF   

Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by tt, a document by dd, and the corpus by DD. Term frequency TF(t,d)TF(t,d) is the number of times that term tt appears in document dd, while document frequency DF(t,D)DF(t,D) is the number of documents that contains term tt. If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g. “a”, “the”, and “of”. If a term appears very often across the corpus, it means it doesn’t carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides:

IDF(t,D)=log|D|+1DF(t,D)+1,
where  |D||D| is the total number of documents in the corpus. Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:
TFIDF(t,d,D)=TF(t,d)⋅IDF(t,D).
There are several variants on the definition of term frequency and document frequency. In MLlib, we separate TF and IDF to make them flexible.

TF: Both HashingTF and CountVectorizer can be used to generate the term frequency vectors.

HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a “set of terms” might be a bag of words. HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices. The default feature dimension is 218=262,144218=262,144. An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

CountVectorizer converts text documents to vectors of term counts. Refer to CountVectorizer for more details.

IDFIDF is an Estimator which is fit on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.

Examples

In the following code segment, we start with a set of sentences. We split each sentence into words using Tokenizer. For each sentence (bag of words), we use HashingTF to hash the sentence into a feature vector. We use IDF to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.


  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Spark实现文字符的TF-IDF,可以按照以下步骤进行: 1. 首先,需要对文文本进行分词。可以使用jieba等文分词工具进行分词。 2. 接着,需要对分词后的文本进行处理,去除停用词和标点符号等无关词汇。 3. 然后,需要计算每个词在文档的词频(TF)和逆文档频率(IDF)。 4. 最后,将TF和IDF相乘,得到每个词的TF-IDF值,并按照降序排列,取排在最前面的几个词作为关键词。 下面是一个使用Spark实现文字符TF-IDF的示例代码: ```python from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StopWordsRemover # 加载数据 data = spark.read.text("data.txt") # 分词 tokenizer = Tokenizer(inputCol="value", outputCol="words") wordsData = tokenizer.transform(data) # 去除停用词和标点符号 stopwords = StopWordsRemover.loadDefaultStopWords("chinese") stopwordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered").setStopWords(stopwords) filteredData = stopwordsRemover.transform(wordsData) # 计算TF hashingTF = HashingTF(inputCol="filtered", outputCol="rawFeatures", numFeatures=10000) featurizedData = hashingTF.transform(filteredData) # 计算IDF idf = IDF(inputCol="rawFeatures", outputCol="features") idfModel = idf.fit(featurizedData) rescaledData = idfModel.transform(featurizedData) # 计算TF-IDF并取前n个关键词 n = 10 keywords = rescaledData.select("filtered", "features").rdd.map(lambda x: x[0], x[1].toArray()).map(lambda x: list(zip(x, range(len(x))))).flatMap(lambda x: x).sortBy(lambda x: x[1], False).map(lambda x: x[0]).take(n) # 输出关键词 print(keywords) ```

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值