工作中遇到的Spark开发经验。
CASE 1
import org.apache.spark.mllib.linalg.{SparseVector, Vector, Vectors}
def cosinSimilarity(v1: SparseVector, v2: SparseVector): Double = {
val indices2: Array[Int] = v2.indices
var sum = 0.0
for (index <- v1.indices) {
if (indices2.contains(index)) {
sum += v1(index) * v2(index)
}
}
sum / (Vectors.norm(v1, 2) * Vectors.norm(v2, 2))
}
计算两个向量之间的内积,向量维度很高,并且非常稀疏。优化点:采用spark ml库中自带的SparseVector类型,即只计算两个向量中相同位置的值不为零的地方,大大减少计算量。
CASE 2
import com.huaban.analysis.jieba.JiebaSegmenter.SegMode
import com.huaban.analysis.jieba.{JiebaSegmenter, SegToken}
def getSrcDF(sc: SparkContext, hiveContext: HiveContext): DataFrame = {
import hiveContext.implicits._
val stopWord_bc = sc.broadcast(getStopWord(sc))
val sql =
s"""
|select id, desc
|from xx_database.xx_table
|""".stripMargin
// 1. filter word of 1-length
// 2. filter stop words
val srcDF: DataFrame = hiveContext.sql(sql).mapPartitions(partition => {
val buffer = new ListBuffer[(String, String)]
val jiebaSegmenter = new JiebaSegmenter() // 注意这里
val it = partition
while (it.hasNext) {
val row = it.next()
val id = row.getLong(0).toString
val text = jiebaSegmenter
.process(row.getString(1).toString, SegMode.SEARCH)
.toArray
.map(_.asInstanceOf[SegToken].word)
.filter(_.length > 1)
.filter(x => !stopWord_bc.value.contains(x))
.mkString(" ")
buffer.append((id, text))
}
buffer.iterator
}).toDF("id", "text")
srcDF
}
以上jieba切词时,优化点为,采用mapPartitions代替map,并将val jiebaSegmenter = new JiebaSegmenter()放在while的外面(避免频繁的new操作,节省开销)。
CASE 3
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import com.xx.xx.cosinSimilarity
val childData: Array[(String, SparseVector)] = child.collect()
val childData_bc = sc.broadcast(childData)
val result = parentData.flatMap(x => {
chilData_bc.value.map{y =>
val cosin = cosinSimilarity(x._2, y._2)
val parenItem = x._1
val childItem = y._1
(parenItem, childItem, cosin)
}.filter(x => x._1 != x._2)
.groupBy(x => x._1).map(x => {
val detapp = x._1
val appList = x._2.mkString(",")
(detapp, appList)
})
计算两两向量之间的相似度,不要采用RDD的笛卡尔积运算,计算复杂度非常高。优化如上,广播变量+两层map实现。