1.spark中repartitionAndSortWithinPartitions实现重分区+排序
def spark_rand(): Unit ={
val data=Array(1, 2, 3, 4, 5, 6,6, 7,3,4, 8, 9, 10, 11, 22, 33, 43, 26, 50, 81, 54, 76, 94, 100)
val s: RDD[(Int, Int)] =spark.sparkContext.parallelize(data,1).map(str=>(str,1))
val mx=s.sortBy(_._1,false).first()._1
s.repartitionAndSortWithinPartitions(new SortPartitoner(4,mx))
.mapPartitionsWithIndex{(partionId,iter)=>
var part_name = "part_" + partionId
var part_map = scala.collection.mutable.Map[String,List[Int]]()
part_map(part_name) = List[Int]()
while(iter.hasNext){
part_map(part_name) :+= iter.next()._1//:+= 列表尾部追加元素
}
part_map.iterator
}
}.collect().foreach(println(_))
//自定义分区器
class SortPartitoner(num: Int,max:Int) extends Partitioner {
override def numPartitions: Int = num
val partitionerSize = max / num + 1
override def getPartition(key: Any): Int = {
val intKey = key.asInstanceOf[Int]
intKey / partitionerSize
}
}
##输出结果
(part_0,List(1, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 9, 10, 11, 22))
18/10/10 11:41:47 INFO SparkContext: Invoking stop() from shutdown hook
(part_1,List(26, 33, 43, 50))
(part_2,List(54, 76))
(part_3,List(81, 94, 100))
备注:
spark分区后再使用sortby,会将数据又进行shullffe操作的,相当于再调用一次rangePatition.
调用sortby后:
(part_0,List(1, 2, 3, 3, 4, 4))
(part_1,List(5, 6, 6, 7, 8, 9))
(part_2,List(10, 11, 22, 26, 33, 43))
(part_3,List(50, 54, 76, 81, 94, 100))
2.重写分区,实现分区排序+二次归并排序实现全局排序
002---数据按照分区输出
val data=Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 22, 33, 43, 26, 50, 81, 54, 76, 94, 100)
spark.sparkContext.parallelize(data,1)
.mapPartitionsWithIndex{(partionId,iter)=>
var part_name = "part_" + partionId
var part_map = scala.collection.mutable.Map[String,List[Int]]()
part_map(part_name) = List[Int]()
while(iter.hasNext){
part_map(part_name) :+= iter.next()//:+= 列表尾部追加元素
}
part_map.iterator
}
}.collect().foreach(println(_))
003--按照数据大小分区,容易造成数据倾斜
val data=Array(1, 2, 3, 4, 5, 6,6, 7, 8, 9, 10, 11, 22, 33, 43, 26, 50, 81, 54, 76, 94, 100)
val s: RDD[(Int, Int)] =spark.sparkContext.parallelize(data,1).map(str=>(str,1))
val mx=s.sortBy(_._1,false).first()._1
s.partitionBy(new SortPartitoner(4,mx))
.mapPartitionsWithIndex{(partionId,iter)=>
var part_name = "part_" + partionId
var part_map = scala.collection.mutable.Map[String,List[Int]]()
part_map(part_name) = List[Int]()
while(iter.hasNext){
part_map(part_name) :+= iter.next()._1//:+= 列表尾部追加元素
}
part_map.iterator
}
}.collect().foreach(println(_))
//自定义分区器
class SortPartitoner(num: Int,max:Int) extends Partitioner {
override def numPartitions: Int = num
val partitionerSize = max / num + 1
override def getPartition(key: Any): Int = {
val intKey = key.asInstanceOf[Int]
intKey / partitionerSize
}
}
##结果
(part_0,List(1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11, 22))
(part_1,List(33, 43, 26, 50))
(part_2,List(54, 76))
(part_3,List(81, 94, 100))
3.mapPartition的几种实现方式
/**
* mappartition的使用
*/
def test_mapPartition(): Unit ={
val sc=spark.sparkContext
val a:RDD[Int] = sc.parallelize(1 to 1000000,2 )
val startTime=System.nanoTime()
println(a.repartition(4).map(str=>str*3).sum())
val endTime=System.nanoTime()
println((endTime-startTime)/1000000000d)
//01-第一种写法(内部创建一个数组,用于缓存所有的数据)
def terFunc(iter: Iterator[Int]) : Iterator[Int] = {
var res = List[Int]()
while (iter.hasNext)
{
val cur = iter.next;
res.::= (cur*3) ;
}
res.iterator
}
val startTime2=System.nanoTime()
val result = a.mapPartitions(terFunc).sum()
val endTime2=System.nanoTime()
println(result+"=="+(endTime2-startTime2)/1000000000d)
//02--第二种写法(自定义迭代器)
val startTime3=System.nanoTime()
val result2 = a.mapPartitions(v => new CustomIterator(v)).sum()
val endTime3=System.nanoTime()
println(result2+"=="+(endTime2-startTime2)/1000000000d)
}
//03--这也是一个mappartition的遍历操作
def mapPartitionsTest(listParam:Iterator[Int]):Iterator[Int]={
println("by partition:")
var res = for(param<-listParam) yield param*2
res
}
//自定义迭代器
class CustomIterator(iter: Iterator[Int]) extends Iterator[Int] {
override def hasNext: Boolean = {
iter.hasNext
}
override def next(): Int = {
val cur=iter.next()
cur*3//返回值
}
}