Spark机器学习：TF-IDF实现原理

最新推荐文章于 2024-07-30 18:04:32 发布

Javis486

最新推荐文章于 2024-07-30 18:04:32 发布

阅读量9.4k

点赞数

本文链接：https://blog.csdn.net/jiangpeng59/article/details/52728062

版权

Spark 专栏收录该内容

38 篇文章 5 订阅

订阅专栏

先简单地介绍下什么是TF-IDF(词频-逆文档频率)，它可以反映出语料库中某篇文档中某个词的重要性。假设t表示某个词，d表示一篇文档，则词频TF(t,d)是某个词t在文档d中出现的次数，而文档DF(t,D)是包含词t的文档数目。为了过滤掉常用的词组，如"the" "a" "of" "that",我们使用逆文档频率来度量一个词能提供多少信息的数值：

IDF(t,D)=log(|D|+1)/(DF(t,D)+1)

这里|D|表示语料库的文档总数，为了不让分母为了0，在此进行了加1平滑操作。而词频-逆文档频率就是TF和IDF的简单相乘：

TFIDF(t,d,D)=TF(t,d)*IDF(t,D)

下面使用Spark官方提供的一个例子，使用的都是mlib的包，具体可以参考：http://spark.apache.org/docs/1.6.1/mllib-feature-extraction.html

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.feature.IDF

object TF_IDF_Test {

  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("TfIdfTest")
    val sc = new SparkContext(conf)
    // Load documents (one per line).
    val documents: RDD[Seq[String]] = sc.textFile("...").map(_.split(" ").toSeq)
    val hashingTF = new HashingTF()
    val tf: RDD[Vector] = hashingTF.transform(documents)
    tf.cache()
    val idf = new IDF().fit(tf)
    val tfidf: RDD[Vector] = idf.transform(tf)
  }

下面对代码进行详细的解释：

1.首先看数据源documents，它作为hashingTF.transform的参数，要求每一行为一篇文档的内容。

2.下面在看hashingTF.transform的方法源码,其调用了HashingTF类自身的transform方法对每一篇文档进行处理

  /**
   * Transforms the input document to term frequency vectors.
   */
  @Since("1.1.0")
  def transform[D <: Iterable[_]](dataset: RDD[D]): RDD[Vector] = {
    dataset.map(this.transform)
  }

3.HashingTF类自身的transform方法，这里的参数document是按空格划分了的单词序列，numFeatures为HashingTF类的成员变量默认为2^20，也就是hash的维数。

最终我们获得的是一个稀疏向量，其下index就是单词的哈希值，value就是单词的频数

* Transforms the input document into a sparse term frequency vector.
*/
@Since("1.1.0")
def transform(document: Iterable[_]): Vector = {
//hash(单词的hash码,单词频数)
val termFrequencies = mutable.HashMap.empty[Int, Double]
//遍历文档的单词
document.foreach { term =>
  val i = indexOf(term) //获得单词的hash码
  //单词频数统计
  termFrequencies.put(i, termFrequencies.getOrElse(i, 0.0) + 1.0)
}
//把结果转换成稀疏向量
Vectors.sparse(numFeatures, termFrequencies.toSeq)
}

3.1 indexof 方法，term.## 等价于获得对象term的哈希值，使用Utils.nonNegativeMod对于获得的哈希值模numFeatures取正余

  /**
   * Returns the index of the input term.
   */
  @Since("1.1.0")
  def indexOf(term: Any): Int = Utils.nonNegativeMod(term.##, numFeatures)

Utils.nonNegativeMod

 /* Calculates 'x' modulo 'mod', takes to consideration sign of x,
  * i.e. if 'x' is negative, than 'x' % 'mod' is negative too
  * so function return (x % mod) + mod in that case.
  */
  def nonNegativeMod(x: Int, mod: Int): Int = {
    val rawMod = x % mod
    rawMod + (if (rawMod < 0) mod else 0)
  }

4.val idf = new IDF().fit(tf)，这里的tf为RDD[Vector]，每个稀疏向量的内容参考3。

  @Since("1.1.0")
  def fit(dataset: RDD[Vector]): IDFModel = {
    val idf = dataset.treeAggregate(new IDF.DocumentFrequencyAggregator(
          minDocFreq = minDocFreq))(
      seqOp = (df, v) => df.add(v),
      combOp = (df1, df2) => df1.merge(df2)
    ).idf()
    new IDFModel(idf)
  }

treeAggregate和Aggregate类似，它把IDF.DocumentFrequencyAggregator作为初始值，seqop为分区类的聚合操作，而comop为分区间的聚合操作。下面具体看下DocumentFrequencyAggregator的内容，其使用成员变量df(密集向量)来记录index(单词hash码)在多少个文档中出现过。使用add方法来合并一个新的文档，并更新df和m的值；因为密集和稀疏操作类似，下面以匹配密集为例，values(j) > 0.0，说明j对应的单词在这篇文档出现过，df(j)+=1。然后使用merge来合并分区间的统计结果(这里只是进行简单的相加)。最后使用idf()方法对treeAggregate的结果使用公式1得到IDF，把结果封装到IDFModel类并返回。

class DocumentFrequencyAggregator(val minDocFreq: Int) extends Serializable {
    //语料库的文档总数
    private var m = 0L
	//BDV为一个密集向量的别名,df对应的值为该index(单词hash码)在多少个文档中出现过
    private var df: BDV[Long] = _

    def this() = this(0)

   //添加一个新的文档
    def add(doc: Vector): this.type = {
      if (isEmpty) {
        df = BDV.zeros(doc.size) //初始化0操作
      }
      doc match {
		//如果是稀疏向量
        case SparseVector(size, indices, values) =>
          val nnz = indices.size
          var k = 0
          while (k < nnz) {
            if (values(k) > 0) {
              df(indices(k)) += 1L
            }
            k += 1
          }
		//如果是密集向量
        case DenseVector(values) =>
          val n = values.size
          var j = 0
          while (j < n) {
			//values(j) > 0.0，说明j对应的单词在这篇文档出现过
            if (values(j) > 0.0) {
              df(j) += 1L
            }
            j += 1
          }
        case other =>
          throw new UnsupportedOperationException(
            s"Only sparse and dense vectors are supported but got ${other.getClass}.")
      }
      m += 1L //语料库的文档数加1
      this
    }

    //合并其他的文档，对文档总数和df进行简单的相加
    def merge(other: DocumentFrequencyAggregator): this.type = {
      if (!other.isEmpty) {
        m += other.m
        if (df == null) {
          df = other.df.copy
        } else {
          df += other.df
        }
      }
      this
    }

    private def isEmpty: Boolean = m == 0L

    /** Returns the current IDF vector. */
    def idf(): Vector = {
      if (isEmpty) {
        throw new IllegalStateException("Haven't seen any document yet.")
      }
      val n = df.length
      val inv = new Array[Double](n)
      var j = 0
      while (j < n) {
        if (df(j) >= minDocFreq) {
          inv(j) = math.log((m + 1.0) / (df(j) + 1.0))
        }
        j += 1
      }
      Vectors.dense(inv)
    }
  }
}

5.val tfidf: RDD[Vector] = idf.transform(tf)，对4得到的idf(IDFModel)乘上tf即得到最终的结果。下面先把idf进行广播，然后和各个分区的tf对应相乘

def transform(dataset: RDD[Vector]): RDD[Vector] = {
    val bcIdf = dataset.context.broadcast(idf)
    dataset.mapPartitions(iter => iter.map(v => IDFModel.transform(bcIdf.value, v)))
  }

对应相乘

def transform(idf: Vector, v: Vector): Vector = {
    val n = v.size
    v match {
      case SparseVector(size, indices, values) =>
        val nnz = indices.size
        val newValues = new Array[Double](nnz)
        var k = 0
        while (k < nnz) {
          newValues(k) = values(k) * idf(indices(k))
          k += 1
        }
        Vectors.sparse(n, indices, newValues)
      case DenseVector(values) =>
        val newValues = new Array[Double](n)
        var j = 0
        while (j < n) {
          newValues(j) = values(j) * idf(j)
          j += 1
        }
        Vectors.dense(newValues)
      case other =>
        throw new UnsupportedOperationException(
          s"Only sparse and dense vectors are supported but got ${other.getClass}.")
    }
}

tfidf最后的输出格式为Sparse向量，具体的例子可以参看：Spark-MLib之TFIDF实例讲解