基于Spark MLlib的TF-IDF与余弦定理应用 - 文档相似度

网上已经有很多优秀的TF-IDF和余弦定理介绍文章,这里就不重复了。简单记录一下如何利用这些原理结合Spark MLlib计算文档相似度。

第一步:从文档中提取词

Spark MLlib - Tokenizer.class // 默认根据空格分割提取词

/**
 * A tokenizer that converts the input string to lowercase and then splits it by white spaces.
 *
 * @see [[RegexTokenizer]]
 */
@Since("1.2.0")
class Tokenizer @Since("1.4.0") (@Since("1.4.0") override val uid: String)
  extends UnaryTransformer[String, Seq[String], Tokenizer] with DefaultParamsWritable {

  @Since("1.2.0")
  def this() = this(Identifiable.randomUID("tok"))

  override protected def createTransformFunc: String => Seq[String] = {
    _.toLowerCase.split("\\s")
  }

  override protected def validateInputType(inputType: DataType): Unit = {
    require(inputType == StringType, s"Input type must be string type but got $inputType.")
  }

  override protected def outputDataType: DataType = new ArrayType(StringType, true)

  @Since("1.4.1")
  override def copy(extra: ParamMap): Tokenizer = defaultCopy(extra)
}

 摘取自Tokenizer.class部分源码。 

第二步:计算词频

Spark MLlib - HashingTF.class // 哈希每个词并计算出现次数

 /**
   * Transforms the input document into a sparse term frequency vector.
   */
  @Since("1.1.0")
  def transform(document: Iterable[_]): Vector = {
    val termFrequencies = mutable.HashMap.empty[Int, Double]
    val setTF = if (binary) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
    val hashFunc: Any => Int = getHashFunction
    document.foreach { term =>
      val i = Utils.nonNegativeMod(hashFunc(term), numFeatures)
      termFrequencies.put(i, setTF(i))
    }
    Vectors.sparse(numFeatures, termFrequencies.toSeq)
  }

摘取自HashingTF.class部分源码。  

第三步:计算逆文档值(TF-IDF)

Spark MLlib - IDF.class

/** Returns the current IDF vector. */
    def idf(): Vector = {
      if (isEmpty) {
        throw new IllegalStateException("Haven't seen any document yet.")
      }
      val n = df.length
      val inv = new Array[Double](n)
      var j = 0
      while (j < n) {
        /*
         * If the term is not present in the minimum
         * number of documents, set IDF to 0. This
         * will cause multiplication in IDFModel to
         * set TF-IDF to 0.
         *
         * Since arrays are initialized to 0 by default,
         * we just omit changing those entries.
         */
        if (df(j) >= minDocFreq) {
          inv(j) = math.log((m + 1.0) / (df(j) + 1.0))
        }
        j += 1
      }
      Vectors.dense(inv)
    }

 摘取自IDF.class部分源码。  

第四步:根据余弦定理计算文档相似度

scalanlp/breeze - 利用breeze库中的函数计算余弦值

def calTwoDocSim(f1: SparseVector, f2: SparseVector): Double = {
    val breeze1 =new SparseVector(f1.indices,f1.values, f1.size)
    val breeze2 =new SparseVector(f2.indices,f2.values, f2.size)
    val cosineSim = breeze1.dot(breeze2) / (norm(breeze1) * norm(breeze2))
    cosineSim
}

参考文章:

http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html

http://dblab.xmu.edu.cn/blog/1261-2/

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值