TF-IDF 含义
TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。
- 主要思想
如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。 - TF
词频(term frequency,TF)指的是某一个给定的词语在该文件中出现的频率; - IDF
逆向文件频率(inverse document frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取以10为底的对数得到:
单词出现的文章数越少,单词IDF值越大
TF-IDF的提取流程:
1. 分词
2. 计算每篇文档的词频 TF值
3. 计算 IDF 值 : (逆文档词频)
4. 计算TF-IDF值
5. 获取相关词汇的 IDF TF-IDF 值
6. 后续优化: 使用 TextRank ,关联文章上下文语义
package com.xiaolin.ML.start_study
import java.util
import com.xiaolin.RecommenderProgram.util.IKAnalyzer
import org.apache.lucene.analysis.TokenStream
import org.apache.spark.ml.feature.{CountVectorizer, HashingTF, IDF, Tokenizer}
import org.apache.spark.ml.linalg.SparseVector
import org.apache.spark.sql.types.{ArrayType, StringType}
import org.apache.spark.sql.{SparkSession, functions}
import scala.collection.mutable.{ArrayBuffer, ListBuffer}
object TFIDFApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("TF-IDF Demo").master("local").getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("ERROR")
//初始化demo数据
val sentenceData = spark.createDataFrame(Seq(
(0.0, "小林爱爸爸,电影"),
(1.0, "我爱中国,非常非常的喜欢电影"),
(2.0, "我爱看普罗米修斯电影,非常的精彩")
)).toDF("label","text")
sentenceData.show(false)
+-----+-------------------------------+
|label|text |
+-----+-------------------------------+
|0.0 |小林爱爸爸,电影 |
|1.0 |我爱中国,非常非常的喜欢电影 |
|2.0 |我爱看普罗米修斯电影,非常的精彩|
+-----+-------------------------------+
//ik分词器
import scala.collection.JavaConversions._
val wordsData = sentenceData.map(x=>{
val analyzer = new IKAnalyzer()
val label = x.getAs[Double]("label")
val text = x.getAs[String]("text")
val strings = analyzer.segmentation(text)
(label,text,strings.toSeq.toList)
}).toDF("label","text","words")
wordsData.show(false)
+-----+-------------------------------+------------------------------+
|label|text |words |
+-----+-------------------------------+------------------------------+
|0.0 |小林爱爸爸,电影 |[小林, 爱, 爸爸, 电影] |
|1.0 |我爱中国,非常非常的喜欢电影 |[爱, 中国, 喜欢, 电影] |
|2.0 |我爱看普罗米修斯电影,非常的精彩|[爱看, 普罗米修斯, 电影, 精彩]|
+-----+-------------------------------+------------------------------+
//Tokenizer分词器 将英文句子分成单词
// val tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words")
// val wordsData = tokenizer.transform(sentenceData)
//HashingTF 将每个词转换成Int型,并计算其在文档中的词频(TF) 不可追溯
//setNumFeatures(200)表示将Hash分桶的数量设置为200个,可以根据你的词语数量来调整,一般来说,这个值越大不同的词被计算为一个Hash值的概率就越小,数据也更准确,但需要消耗更大的内存
// val hashingTF = new HashingTF()
// .setInputCol("words")
// .setOutputCol("TF Features")
// .setNumFeatures(200)
//
// val featurizedData = hashingTF.transform(wordsData)
// CountVectorizer 文档词频 可追溯 每个单词对应的IDF
val cvModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("TF Features")
.fit(wordsData)
val featurizedData = cvModel.transform(wordsData)
featurizedData.show(false)
+-----+-------------------------------+------------------------------+-------------------------------+
|label|text |words |TF Features |
+-----+-------------------------------+------------------------------+-------------------------------+
|0.0 |小林爱爸爸,电影 |[小林, 爱, 爸爸, 电影] |(9,[0,1,2,8],[1.0,1.0,1.0,1.0])|
|1.0 |我爱中国,非常非常的喜欢电影 |[爱, 中国, 喜欢, 电影] |(9,[0,1,5,6],[1.0,1.0,1.0,1.0])|
|2.0 |我爱看普罗米修斯电影,非常的精彩|[爱看, 普罗米修斯, 电影, 精彩]|(9,[0,3,4,7],[1.0,1.0,1.0,1.0])|
+-----+-------------------------------+------------------------------+-------------------------------+
// 计算IDF
val idf = new IDF().setInputCol("TF Features").setOutputCol("TF-IDF features")
val idfModel = idf.fit(featurizedData)
//保存model 读取保存的model 数据 idfModel.write.overwrite().save("file:///D:\\IDEAspaces\\bigdata_study\\bigdata_spark\\data\\model\\idf.model")
spark.read
.parquet("file:///D:\\IDEAspaces\\bigdata_study\\bigdata_spark\\data\\model\\idf.model\\data")
.show(false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|idf |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[0.0,0.28768207245178085,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453,0.6931471805599453]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
//计算TF-IDF
val rescaledData = idfModel.transform(featurizedData)
rescaledData.show(false)
+-----+-------------------------------+------------------------------+-------------------------------+-----------------------------------------------------------------------------+
|label|text |words |TF Features |TF-IDF features |
+-----+-------------------------------+------------------------------+-------------------------------+-----------------------------------------------------------------------------+
|0.0 |小林爱爸爸,电影 |[小林, 爱, 爸爸, 电影] |(9,[0,1,2,8],[1.0,1.0,1.0,1.0])|(9,[0,1,2,8],[0.0,0.28768207245178085,0.6931471805599453,0.6931471805599453])|
|1.0 |我爱中国,非常非常的喜欢电影 |[爱, 中国, 喜欢, 电影] |(9,[0,1,5,6],[1.0,1.0,1.0,1.0])|(9,[0,1,5,6],[0.0,0.28768207245178085,0.6931471805599453,0.6931471805599453])|
|2.0 |我爱看普罗米修斯电影,非常的精彩|[爱看, 普罗米修斯, 电影, 精彩]|(9,[0,3,4,7],[1.0,1.0,1.0,1.0])|(9,[0,3,4,7],[0.0,0.6931471805599453,0.6931471805599453,0.6931471805599453]) |
+-----+-------------------------------+------------------------------+-------------------------------+-----------------------------------------------------------------------------+
val words2 = cvModel.vocabulary
//获取 词 对应的 idf tf_idf 值
rescaledData.rdd.mapPartitions(partition=>{
val rest = new ListBuffer[(Double, Int, Double,String)]
val topN = 10
while (partition.hasNext) {
val row = partition.next()
var idfVals = row.getAs[SparseVector]("TF-IDF features")
val tmpList = new ListBuffer[(Int, Double)]
idfVals.indices.map(i=>{
tmpList += ((i, idfVals(i)))
})
val buffer = tmpList.sortBy(_._2).reverse
for (item <- buffer.take(topN))
rest += ((row.getAs[Double]("label"), item._1, item._2,words2(item._1)))
}
rest.iterator
}).toDF("item_id", "index", "tfidf","word").show(false)
|item_id|index|tfidf |word |
+-------+-----+-------------------+----------+
|0.0 |8 |0.6931471805599453 |爸爸 |
|0.0 |2 |0.6931471805599453 |小林 |
|0.0 |1 |0.28768207245178085|爱 |
|0.0 |0 |0.0 |电影 |
|1.0 |6 |0.6931471805599453 |中国 |
|1.0 |5 |0.6931471805599453 |喜欢 |
|1.0 |1 |0.28768207245178085|爱 |
|1.0 |0 |0.0 |电影 |
|2.0 |7 |0.6931471805599453 |爱看 |
|2.0 |4 |0.6931471805599453 |精彩 |
|2.0 |3 |0.6931471805599453 |普罗米修斯|
|2.0 |0 |0.0 |电影 |
+-------+-----+-------------------+----------+
spark.close()
}
}