数据倾斜原理:
做join的时候 发现数据都跑到同一个节点上了,这个就是数据倾斜,通过spark的ui界面能发现有些任务shuffle的数据量特别大这时候就可以判定数据倾斜了
产生测数据
package com.spark.data
import java.io.{File, FileOutputStream, FileWriter}
import java.util
import org.apache.commons.io.IOUtils
/**
* Created by zh on 2017/5/3.
*/
object CreateData {
def createOrders(): Unit ={
val file = new File("C:\\Users\\zh\\Desktop\\111\\data\\orders")
val sb = new StringBuilder("1\t张三\n")
sb.append("2\t李四\n")
sb.append("3\t王二麻子\n")
sb.append("4\tpetter\n")
sb.append("5\tmoto\n")
IOUtils.write(sb.toString(),new FileOutputStream(file),"utf-8")
}
def createProducts(): Unit ={
val file = new File("C:\\Users\\zh\\Desktop\\111\\data\\products")
val sb = new StringBuilder("1\t001\n")
for (i <- 1 to 100000000){
sb.append(s"1\t00$i\n")
if (i % 100000 ==0){
val arr = new util.ArrayList[String]()
arr.add(sb.toString())
val writer = new FileWriter(file,true)
writer.write(sb.toString())
writer.close()
//IOUtils.writeLines(arr,"utf-8",new FileOutputStream(file))
sb.clear()
}
}
for (i <- 1 to 100){
sb.append(s"2\t00$i\n")
if (i % 100000 ==0){
val arr = new util.ArrayList[String]()
arr.add(sb.toString())
val writer = new FileWriter(file,true)
writer.write(sb.toString())
writer.close()
//IOUtils.writeLines(arr,"utf-8",new FileOutputStream(file))
sb.clear()
}
}
for (i <- 1 to 10){
sb.append(s"3\t00$i\n")
if (i % 100000 ==0){
val arr = new util.ArrayList[String]()
arr.add(sb.toString())
val writer = new FileWriter(file,true)
writer.write(sb.toString())
writer.close()
//IOUtils.writeLines(arr,"utf-8",new FileOutputStream(file))
sb.clear()
}
}
}
def main(args: Array[String]): Unit = {
createProducts()
}
}
将产生的数据文件orders和product放到hdfs上即可
Products数据中id为1的数据有1亿条记录,这样如果进行按照key进行join会导致数据倾斜
不处理直接使用join,导致倾斜
package com.spark.data
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by zh on 2017/5/3.
*/
object SparkDataSkew {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
val sc = new SparkContext(conf)
val ordersRdd = sc.textFile("/tmp/data/orders")
val ordersPairs = ordersRdd.map(line => {
val fileds = line.split("\t")
(fileds(0),fileds(1))
})
val productsRdd = sc.textFile("/tmp/data/products")
val productsPairs = productsRdd.map(line => {
val fieds = line.split("\t")
(fieds(0),fieds(1))
})
val joinData: RDD[(String, (String, String))] = ordersPairs.join(productsPairs)
val s = joinData.count()
println(s" join之后的data数目为$s")
}
}
这段代码没有任何变动,一般都这么写,但是,会出现数据倾斜的情况
提交脚本
spark-submit --class com.spark.data.SparkDataSkew --master yarn --deploy-mode cluster --driver-memory 3G --num-executors 6 --executor-cores 2 --executor-memory 6G ebay.jar
直接报错
广播变量
思路:将数据量比较少的rdd进行collect到driver上,然后进行广播变量,在大的products rdd进行做map的join.从而避免了shuffle join
如果对广播变量熟悉,请百度吧或者看
http://spark.apache.org/docs/1.6.3/programming-guide.html#broadcast-variables
package com.spark.data
import java.util
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
/**
* Created by zh on 2017/5/4.
*/
object SparkDataSkewBroadcast {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.set("org.apache.spark.shuffle.hash.HashShuffleManager","hash")
val sc = new SparkContext(conf)
val ordersRdd = sc.textFile("/tmp/data/orders")
val ordersPairs = ordersRdd.map(line => {
val fileds = line.split("\t")
(fileds(0),fileds(1))
}).collect()
val ordersValue = sc.broadcast(ordersPairs)
val productsRdd = sc.textFile("/tmp/data/products")
val productsPairs = productsRdd.map(line => {
val fieds = line.split("\t")
(fieds(0),fieds(1))
})
val joinData = productsPairs.map(it => {
val orders: Array[(String, String)] = ordersValue.value
val map = new util.HashMap[String,String]()
for ((id,name) <- orders){
map.put(id,name)
}
map.get(it._1)
(it._1,it._2,map.get(it._1))
})
//val joinData: RDD[(String, (String, String))] = ordersPairs.join(productsPairs)
val s = joinData.count()
println(s" join之后的data数目为$s")
}
}
spark-submit --class com.spark.data.SparkDataSkewBroadcast --master yarn --deploy-mode cluster --driver-memory 3G --num-executors 6 --executor-cores 2 --executor-memory 6G ebay.jar
20秒结束
随机数
思路:对orderrdd的key进行扩容50倍,每一个的key都添加上一个尾数,尾数范围为0-49
然后对product rdd的key添加上随机数 50范围之内
这样能保证扩容的order和product能进行join,认为的将数据倾斜的rdd进行分配到多个机器上
package com.spark.data
import java.util
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import scala.collection.mutable.ArrayBuffer
import scala.util.Random
/**
* Created by zh on 2017/5/4.
*/
object SparkDataSkewRandom {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.set("org.apache.spark.shuffle.hash.HashShuffleManager","hash")
val sc = new SparkContext(conf)
val ordersRdd = sc.textFile("/tmp/data/orders")
val ordersPairs = ordersRdd.map(line => {
val fileds = line.split("\t")
(fileds(0),fileds(1))
})
//对 ordersrdd进行扩容50倍 key变成 key-i
val ordersPairsExpend = ordersPairs.flatMap(it => {
for(i <- 0 until 50) yield{
(it._1+"_"+i,it._2)
}
})
val productsRdd = sc.textFile("/tmp/data/products")
val productsPairs = productsRdd.map(line => {
val fieds = line.split("\t")
(fieds(0),fieds(1))
})
//对products的key进行添加上50以内的随机数
val productsPairRandom = productsPairs.map(it => {
val random = new Random()
(it._1+"-"+random.nextInt(50),it._2)
})
val joinData: RDD[(String, (String, String))] = ordersPairsExpend.join(productsPairRandom)
val s = joinData.count()
println(s" join之后的data数目为$s")
}
}
脚本
spark-submit --class com.spark.data.SparkDataSkewRandom --master yarn --deploy-mode cluster --driver-memory 3G --num-executors 6 --executor-cores 2 --executor-memory 6G ebay.jar
36秒结束