[Spark] 代码优化技巧

工作中遇到的Spark开发经验。

CASE 1

import org.apache.spark.mllib.linalg.{SparseVector, Vector, Vectors}

def cosinSimilarity(v1: SparseVector, v2: SparseVector): Double = {
  val indices2: Array[Int] = v2.indices
  var sum = 0.0
  for (index <- v1.indices) {
    if (indices2.contains(index)) {
    sum += v1(index) * v2(index)
    }
  }
  sum / (Vectors.norm(v1, 2) * Vectors.norm(v2, 2))
}

计算两个向量之间的内积,向量维度很高,并且非常稀疏。优化点:采用spark ml库中自带的SparseVector类型,即只计算两个向量中相同位置的值不为零的地方,大大减少计算量。

CASE 2

import com.huaban.analysis.jieba.JiebaSegmenter.SegMode
import com.huaban.analysis.jieba.{JiebaSegmenter, SegToken}

def getSrcDF(sc: SparkContext, hiveContext: HiveContext): DataFrame = {
    import hiveContext.implicits._
    val stopWord_bc = sc.broadcast(getStopWord(sc))
    val sql =
        s"""
           |select id, desc
           |from xx_database.xx_table
           |""".stripMargin
      
    // 1. filter word of 1-length
    // 2. filter stop words
    val srcDF: DataFrame = hiveContext.sql(sql).mapPartitions(partition => {
        val buffer = new ListBuffer[(String, String)]
        val jiebaSegmenter = new JiebaSegmenter() // 注意这里

        val it = partition
        while (it.hasNext) {
            val row = it.next()
            val id = row.getLong(0).toString
            val text = jiebaSegmenter
                .process(row.getString(1).toString, SegMode.SEARCH)
                .toArray
                .map(_.asInstanceOf[SegToken].word)
                .filter(_.length > 1)
                .filter(x => !stopWord_bc.value.contains(x))
                .mkString(" ")
            buffer.append((id, text))
        }
        buffer.iterator
    }).toDF("id", "text")

    srcDF
}

以上jieba切词时,优化点为,采用mapPartitions代替map,并将val jiebaSegmenter = new JiebaSegmenter()放在while的外面(避免频繁的new操作,节省开销)。

CASE 3

import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import com.xx.xx.cosinSimilarity

val childData: Array[(String, SparseVector)] = child.collect()
val childData_bc = sc.broadcast(childData)

val result = parentData.flatMap(x => {
    chilData_bc.value.map{y =>
        val cosin = cosinSimilarity(x._2, y._2)
        val parenItem = x._1
        val childItem = y._1
      (parenItem, childItem, cosin)
    }.filter(x => x._1 != x._2)
    .groupBy(x => x._1).map(x => {
        val detapp = x._1
        val appList = x._2.mkString(",")
        (detapp, appList)
    })

计算两两向量之间的相似度,不要采用RDD的笛卡尔积运算,计算复杂度非常高。优化如上,广播变量+两层map实现。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值