基于Spark MLlib的TF-IDF与余弦定理应用 - 文档相似度

最新推荐文章于 2021-05-09 08:06:18 发布

ooobenooo

最新推荐文章于 2021-05-09 08:06:18 发布

阅读量233

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/ooobenooo/article/details/108193337

版权

Spark 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

网上已经有很多优秀的TF-IDF和余弦定理介绍文章，这里就不重复了。简单记录一下如何利用这些原理结合Spark MLlib计算文档相似度。

第一步：从文档中提取词

Spark MLlib - Tokenizer.class // 默认根据空格分割提取词

/**
 * A tokenizer that converts the input string to lowercase and then splits it by white spaces.
 *
 * @see [[RegexTokenizer]]
 */
@Since("1.2.0")
class Tokenizer @Since("1.4.0") (@Since("1.4.0") override val uid: String)
  extends UnaryTransformer[String, Seq[String], Tokenizer] with DefaultParamsWritable {

  @Since("1.2.0")
  def this() = this(Identifiable.randomUID("tok"))

  override protected def createTransformFunc: String => Seq[String] = {
    _.toLowerCase.split("\\s")
  }

  override protected def validateInputType(inputType: DataType): Unit = {
    require(inputType == StringType, s"Input type must be string type but got $inputType.")
  }

  override protected def outputDataType: DataType = new ArrayType(StringType, true)

  @Since("1.4.1")
  override def copy(extra: ParamMap): Tokenizer = defaultCopy(extra)
}

摘取自Tokenizer.class部分源码。

第二步：计算词频

Spark MLlib - HashingTF.class // 哈希每个词并计算出现次数

 /**
   * Transforms the input document into a sparse term frequency vector.
   */
  @Since("1.1.0")
  def transform(document: Iterable[_]): Vector = {
    val termFrequencies = mutable.HashMap.empty[Int, Double]
    val setTF = if (binary) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
    val hashFunc: Any => Int = getHashFunction
    document.foreach { term =>
      val i = Utils.nonNegativeMod(hashFunc(term), numFeatures)
      termFrequencies.put(i, setTF(i))
    }
    Vectors.sparse(numFeatures, termFrequencies.toSeq)
  }

摘取自HashingTF.class部分源码。

第三步：计算逆文档值（TF-IDF）

Spark MLlib - IDF.class

/** Returns the current IDF vector. */
    def idf(): Vector = {
      if (isEmpty) {
        throw new IllegalStateException("Haven't seen any document yet.")
      }
      val n = df.length
      val inv = new Array[Double](n)
      var j = 0
      while (j < n) {
        /*
         * If the term is not present in the minimum
         * number of documents, set IDF to 0. This
         * will cause multiplication in IDFModel to
         * set TF-IDF to 0.
         *
         * Since arrays are initialized to 0 by default,
         * we just omit changing those entries.
         */
        if (df(j) >= minDocFreq) {
          inv(j) = math.log((m + 1.0) / (df(j) + 1.0))
        }
        j += 1
      }
      Vectors.dense(inv)
    }

摘取自IDF.class部分源码。

第四步：根据余弦定理计算文档相似度

scalanlp/breeze - 利用breeze库中的函数计算余弦值

def calTwoDocSim(f1: SparseVector, f2: SparseVector): Double = {
    val breeze1 =new SparseVector(f1.indices,f1.values, f1.size)
    val breeze2 =new SparseVector(f2.indices,f2.values, f2.size)
    val cosineSim = breeze1.dot(breeze2) / (norm(breeze1) * norm(breeze2))
    cosineSim
}

参考文章：

http://www.ruanyifeng.com/blog/2013/03/tf-idf.html

http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html

http://dblab.xmu.edu.cn/blog/1261-2/

ooobenooo

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
基于Spark MLlib的TF-IDF与余弦定理应用 - 文档相似度

网上已经有很多优秀的TF-IDF和余弦定理介绍文章，这里就不重复了。简单记录一下如何利用这些原理结合Spark MLlib计算文档相似度。第一步：从文档中提取词Spark MLlib - Tokenizer.class // 默认根据空格分割提取词/** * A tokenizer that converts the input string to lowercase and then splits it by white spaces. * * @see [[RegexTokenize
复制链接

扫一扫

专栏目录