网上已经有很多优秀的TF-IDF和余弦定理介绍文章,这里就不重复了。简单记录一下如何利用这些原理结合Spark MLlib计算文档相似度。
第一步:从文档中提取词
Spark MLlib - Tokenizer.class // 默认根据空格分割提取词
/**
* A tokenizer that converts the input string to lowercase and then splits it by white spaces.
*
* @see [[RegexTokenizer]]
*/
@Since("1.2.0")
class Tokenizer @Since("1.4.0") (@Since("1.4.0") override val uid: String)
extends UnaryTransformer[String, Seq[String], Tokenizer] with DefaultParamsWritable {
@Since("1.2.0")
def this() = this(Identifiable.randomUID("tok"))
override protected def createTransformFunc: String => Seq[String] = {
_.toLowerCase.split("\\s")
}
override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == StringType, s"Input type must be string type but got $inputType.")
}
override protected def outputDataType: DataType = new ArrayType(StringType, true)
@Since("1.4.1")
override def copy(extra: ParamMap): Tokenizer = defaultCopy(extra)
}
摘取自Tokenizer.class部分源码。
第二步:计算词频
Spark MLlib - HashingTF.class // 哈希每个词并计算出现次数
/**
* Transforms the input document into a sparse term frequency vector.
*/
@Since("1.1.0")
def transform(document: Iterable[_]): Vector = {
val termFrequencies = mutable.HashMap.empty[Int, Double]
val setTF = if (binary) (i: Int) => 1.0 else (i: Int) => termFrequencies.getOrElse(i, 0.0) + 1.0
val hashFunc: Any => Int = getHashFunction
document.foreach { term =>
val i = Utils.nonNegativeMod(hashFunc(term), numFeatures)
termFrequencies.put(i, setTF(i))
}
Vectors.sparse(numFeatures, termFrequencies.toSeq)
}
摘取自HashingTF.class部分源码。
第三步:计算逆文档值(TF-IDF)
Spark MLlib - IDF.class
/** Returns the current IDF vector. */
def idf(): Vector = {
if (isEmpty) {
throw new IllegalStateException("Haven't seen any document yet.")
}
val n = df.length
val inv = new Array[Double](n)
var j = 0
while (j < n) {
/*
* If the term is not present in the minimum
* number of documents, set IDF to 0. This
* will cause multiplication in IDFModel to
* set TF-IDF to 0.
*
* Since arrays are initialized to 0 by default,
* we just omit changing those entries.
*/
if (df(j) >= minDocFreq) {
inv(j) = math.log((m + 1.0) / (df(j) + 1.0))
}
j += 1
}
Vectors.dense(inv)
}
摘取自IDF.class部分源码。
第四步:根据余弦定理计算文档相似度
scalanlp/breeze - 利用breeze库中的函数计算余弦值
def calTwoDocSim(f1: SparseVector, f2: SparseVector): Double = {
val breeze1 =new SparseVector(f1.indices,f1.values, f1.size)
val breeze2 =new SparseVector(f2.indices,f2.values, f2.size)
val cosineSim = breeze1.dot(breeze2) / (norm(breeze1) * norm(breeze2))
cosineSim
}
参考文章:
http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html
http://dblab.xmu.edu.cn/blog/1261-2/