spark core 数据倾斜时如何处理

数据倾斜原理:


join的时候 发现数据都跑到同一个节点上了,这个就是数据倾斜,通过spark的ui界面能发现有些任务shuffle的数据量特别大这时候就可以判定数据倾斜了

 

产生测数据

package com.spark.data

 

import java.io.{File, FileOutputStream, FileWriter}

import java.util

 

import org.apache.commons.io.IOUtils

 

 

/**

  * Created by zh on 2017/5/3.

  */

object CreateData {

 

  def createOrders(): Unit ={

    val file = new File("C:\\Users\\zh\\Desktop\\111\\data\\orders")

    val sb = new StringBuilder("1\t张三\n")

    sb.append("2\t李四\n")

    sb.append("3\t王二麻子\n")

    sb.append("4\tpetter\n")

    sb.append("5\tmoto\n")

    IOUtils.write(sb.toString(),new FileOutputStream(file),"utf-8")

  }

 

  def createProducts(): Unit ={

    val file = new File("C:\\Users\\zh\\Desktop\\111\\data\\products")

    val sb = new StringBuilder("1\t001\n")

    for (i <- 1 to 100000000){

      sb.append(s"1\t00$i\n")

      if (i % 100000 ==0){

        val arr = new util.ArrayList[String]()

        arr.add(sb.toString())

        val writer = new FileWriter(file,true)

        writer.write(sb.toString())

        writer.close()

        //IOUtils.writeLines(arr,"utf-8",new FileOutputStream(file))

        sb.clear()

      }

    }

 

    for (i <- 1 to 100){

      sb.append(s"2\t00$i\n")

      if (i % 100000 ==0){

        val arr = new util.ArrayList[String]()

        arr.add(sb.toString())

        val writer = new FileWriter(file,true)

        writer.write(sb.toString())

        writer.close()

        //IOUtils.writeLines(arr,"utf-8",new FileOutputStream(file))

        sb.clear()

      }

    }

 

    for (i <- 1 to 10){

      sb.append(s"3\t00$i\n")

      if (i % 100000 ==0){

        val arr = new util.ArrayList[String]()

        arr.add(sb.toString())

        val writer = new FileWriter(file,true)

        writer.write(sb.toString())

        writer.close()

        //IOUtils.writeLines(arr,"utf-8",new FileOutputStream(file))

        sb.clear()

      }

    }

 

 

  }

 

  def main(args: Array[String]): Unit = {

    createProducts()

 

  }

 

}

将产生的数据文件ordersproduct放到hdfs上即可

Products数据中id1的数据有1亿条记录,这样如果进行按照key进行join会导致数据倾斜

不处理直接使用join,导致倾斜

package com.spark.data

 

import org.apache.spark.rdd.RDD

import org.apache.spark.{SparkConf, SparkContext}

 

/**

  * Created by zh on 2017/5/3.

  */

object SparkDataSkew {

 

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()

    val sc = new SparkContext(conf)

 

    val ordersRdd = sc.textFile("/tmp/data/orders")

    val ordersPairs = ordersRdd.map(line => {

      val fileds = line.split("\t")

      (fileds(0),fileds(1))

    })

 

 

    val productsRdd = sc.textFile("/tmp/data/products")

    val productsPairs = productsRdd.map(line => {

      val fieds = line.split("\t")

      (fieds(0),fieds(1))

    })

 

 

    val joinData: RDD[(String, (String, String))] = ordersPairs.join(productsPairs)

    val s = joinData.count()

    println(s" join之后的data数目为$s")

 

 

 

 

  }

 

}

这段代码没有任何变动,一般都这么写,但是,会出现数据倾斜的情况

 

提交脚本

spark-submit --class com.spark.data.SparkDataSkew --master yarn --deploy-mode cluster --driver-memory 3G --num-executors 6 --executor-cores 2 --executor-memory 6G  ebay.jar


 

直接报错

 

广播变量

思路:将数据量比较少的rdd进行collectdriver,然后进行广播变量,在大的products rdd进行做mapjoin.从而避免了shuffle join

如果对广播变量熟悉,请百度吧或者看

http://spark.apache.org/docs/1.6.3/programming-guide.html#broadcast-variables

 

package com.spark.data

 

import java.util

 

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.rdd.RDD

 

/**

  * Created by zh on 2017/5/4.

  */

object SparkDataSkewBroadcast {

 

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()

    conf.set("org.apache.spark.shuffle.hash.HashShuffleManager","hash")

    val sc = new SparkContext(conf)

 

    val ordersRdd = sc.textFile("/tmp/data/orders")

    val ordersPairs = ordersRdd.map(line => {

      val fileds = line.split("\t")

      (fileds(0),fileds(1))

    }).collect()

    val ordersValue = sc.broadcast(ordersPairs)

 

 

    val productsRdd = sc.textFile("/tmp/data/products")

    val productsPairs = productsRdd.map(line => {

      val fieds = line.split("\t")

      (fieds(0),fieds(1))

    })

 

    val joinData = productsPairs.map(it => {

      val orders: Array[(String, String)] = ordersValue.value

      val map = new util.HashMap[String,String]()

      for ((id,name) <- orders){

        map.put(id,name)

      }

      map.get(it._1)

      (it._1,it._2,map.get(it._1))

    })

    //val joinData: RDD[(String, (String, String))] = ordersPairs.join(productsPairs)

    val s = joinData.count()

    println(s" join之后的data数目为$s")

 

 

 

 

  }

 

}

 

spark-submit --class com.spark.data.SparkDataSkewBroadcast --master yarn --deploy-mode cluster --driver-memory 3G --num-executors 6 --executor-cores 2 --executor-memory 6G  ebay.jar

 

 

20秒结束

随机数

思路:orderrddkey进行扩容50,每一个的key都添加上一个尾数,尾数范围为0-49

然后对product rddkey添加上随机数 50范围之内

这样能保证扩容的orderproduct能进行join,认为的将数据倾斜的rdd进行分配到多个机器上

package com.spark.data

 

import java.util

 

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.rdd.RDD

 

import scala.collection.mutable.ArrayBuffer

import scala.util.Random

 

/**

  * Created by zh on 2017/5/4.

  */

object SparkDataSkewRandom {

 

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf()

    conf.set("org.apache.spark.shuffle.hash.HashShuffleManager","hash")

    val sc = new SparkContext(conf)

 

    val ordersRdd = sc.textFile("/tmp/data/orders")

    val ordersPairs = ordersRdd.map(line => {

      val fileds = line.split("\t")

      (fileds(0),fileds(1))

    })

 

    //ordersrdd进行扩容50key变成 key-i

    val ordersPairsExpend = ordersPairs.flatMap(it => {

      for(i <- 0 until  50) yield{

        (it._1+"_"+i,it._2)

      }

    })

 

 

 

    val productsRdd = sc.textFile("/tmp/data/products")

    val productsPairs = productsRdd.map(line => {

      val fieds = line.split("\t")

      (fieds(0),fieds(1))

    })

    //productskey进行添加上50以内的随机数

 

 

   val productsPairRandom = productsPairs.map(it => {

      val random = new Random()

      (it._1+"-"+random.nextInt(50),it._2)

    })

 

 

    val joinData: RDD[(String, (String, String))] = ordersPairsExpend.join(productsPairRandom)

    val s = joinData.count()

    println(s" join之后的data数目为$s")

 

 

 

 

  }

 

}

 

脚本

spark-submit --class com.spark.data.SparkDataSkewRandom --master yarn --deploy-mode cluster --driver-memory 3G --num-executors 6 --executor-cores 2 --executor-memory 6G  ebay.jar

 

 

 

36秒结束

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值