前言:
为什么我们要进行自定义分区
当spark进行数据处理key-value类型数据时,会遇到数据由于key值的分布不均倾斜的情况,为了使得资源的合理布置我们会进行重分区,根据spark内部提供的分区器 HashPartitioner & RangePartitioner,我们也可以实现自定义
不bb了:
test case:用如下数据做word count 实现自定义分区
hadoop,
spark
hive
hive
spark
hbase,
hbase
hbase
hbase,
hbase
kafka
kafka,
coding as below:
为了使得每一个分区都能都直接在命令行直接见到分区情况,我使用了mapPartitions 对value都进行+1 的操作,顺便也是做一下算子的练习
package com.brd.engine
import com.brd.util.sparkUtils.partitionTest
import org.apache.spark.{HashPartitioner, RangePartitioner, SparkConf, SparkContext}
object SparkTest {
/**@desc
* @Author brandon
* @param args
*/
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setAppName("SparkTest")
.setMaster("local[1]")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
//sparkUtils.mapPartUse(sc)
val rdd = sc.textFile("src/data/wd.txt")
val rdd2 = rdd.flatMap(line => line.split(" ")).map(word => {
word.replaceAll(",", "").trim
}).map(word => (word, 1))
.reduceByKey(_ + _)
.repartitionAndSortWithinPartitions(new partitionTest(4))
// 这里可以将mapPartitions的内部结构进行封装为函数传入
.mapPartitions(this.mapUse2)
rdd2.foreach(x => println(x))
sc.stop()
}
//1. 这里做一个简单的函数赋值,来进行封装
val mapUse:Iterator[(String,Int)]=>Iterator[(String,Int)]=x=>{
println("**********************the partition line ***********************")
x.toList.map(y => (y._1, y._2 + 1)).toIterator
}
//2.
val mapUse2=(x:Iterator[(String,Int)])=>{
println("**********************the partition line ***********************")
x.toList.map(y => (y._1, y._2 + 1)).toIterator
}
}
--------------------自定义分区器的实现---------------------------------
/**
* @description input a variable to confirm number of partitions
* and extends Partitioner and override function - numPartitions & getPartition
*
* @numPartitions to ensure number of partitions
* @getPartition according to rule to separate data into different partition
*
* @case analysis : distribute key="hbase" into partition 1
* distribute key="spark" into partition 2
* others into partition 0
*/
class partitionTest(Partitions:Int) extends Partitioner {
override def numPartitions: Int = Partitions
override def getPartition(key: Any):Int = {
val a =if (key.toString.indexOf("hbase")!= -1) {
1
}else if (key.toString.indexOf("spark")!= -1){
2
}else{
0
}
a
}
}
从代码块来看是做的四个分区 来看看打印效果:
定义四个分区 hbase 和spark 都是在单独分区内
**********************the partition line ***********************
**********************the partition line ***********************
(hadoop,2)
(hive,3)
(kafka,3)
**********************the partition line ***********************
(hbase,6)
**********************the partition line ***********************
(spark,3)
定义三个分区 ,hbase 和spark同样也是在单独的分区内
**********************the partition line ***********************
(hadoop,2)
(hive,3)
(kafka,3)
**********************the partition line ***********************
(hbase,6)
**********************the partition line ***********************
(spark,3)
看来已经实现了自定义分区
End Respect.