黑暗之神

Fighting in the dark

Spark实践之join优化


join优化应该是spark相关岗位面试必考的内容。 join其实常见的就分为两类: map-side join 和  reduce-side join。当大表和小表join时,用map-side join能显著提高效率。。


/**
 * Created by shenjiyi on 2015/7/8.
 */

package com.test

import com.test.utils.MySparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object TestJoin {
  def main (args: Array[String]): Unit ={
    val conf = new SparkConf()
      .setMaster(args(0))
      .setAppName("TestJoin")
      .set("spark.speculation", "true")
      .set("spark.default.parallelism", "200")
    val sc = new MySparkContext(conf)

    val input1 = sc.rawTextFile(args(1), "GB18030")
    val input2 = sc.rawTextFile(args(2), "GB18030")
    val output1 = args(3)
    val output2 = args(4)

    val pairs = input1.map { x =>
      val pos = x.indexOf(',')
      (x.substring(0, pos), x.substring(pos + 1))
    }.collectAsMap()


    //map-side join 适用于小表和大表join的情况
    //将小表load到内存,然后broad到各个节点之后,再个大表做join,可以避免shuffle,提高效率
    val broadCastMap = sc.broadcast(pairs)
    val result = input2.map{ x =>
      val pos = x.indexOf('\t')
      (x.substring(0, pos), x.substring(pos + 1))
    }.mapPartitions { iter =>
      val m = broadCastMap.value
      for {
        (k, v) <- iter
        if (m.contains(k))
      } yield (k, (v, m.get(k).getOrElse("")))
    }.saveAsTextFile(output1)


    //reduce-side join
    val pairs2 = input1.map { x =>
      val pos = x.indexOf('\t')
      (x.substring(0, pos), x.substring(pos + 1))
    }
    val result2 = input2.map { x =>
      val pos = x.indexOf('\t')
      (x.substring(0, pos), x.substring(pos + 1))
    }.join(pairs2).saveAsTextFile(output2)
  }
}


阅读更多
个人分类: spark
上一篇Ubuntu上搭建Hadoop环境(单机模式+伪分布模式)
下一篇一个ACM渣渣关于找工作的胡扯
想对作者说点什么? 我来说一句

没有更多推荐了,返回首页

关闭
关闭