spark中的分区操作回顾--mapPartition

1.spark中repartitionAndSortWithinPartitions实现重分区+排序

def spark_rand(): Unit ={
  val data=Array(1, 2, 3, 4, 5, 6,6, 7,3,4, 8, 9, 10, 11, 22, 33, 43, 26, 50, 81, 54, 76, 94, 100)
  val s: RDD[(Int, Int)] =spark.sparkContext.parallelize(data,1).map(str=>(str,1))
  val mx=s.sortBy(_._1,false).first()._1
      s.repartitionAndSortWithinPartitions(new SortPartitoner(4,mx))
    .mapPartitionsWithIndex{(partionId,iter)=>
    var part_name = "part_" + partionId
    var part_map = scala.collection.mutable.Map[String,List[Int]]()
    part_map(part_name) = List[Int]()
    while(iter.hasNext){
      part_map(part_name) :+= iter.next()._1//:+= 列表尾部追加元素
      }
      part_map.iterator
  }
}.collect().foreach(println(_))

//自定义分区器
class SortPartitoner(num: Int,max:Int) extends Partitioner {
  override def numPartitions: Int = num
  val partitionerSize = max / num + 1
  override def getPartition(key: Any): Int = {
    val intKey = key.asInstanceOf[Int]
    intKey / partitionerSize
  }
}

##输出结果
(part_0,List(1, 2, 3, 3, 4, 4, 5, 6, 6, 7, 8, 9, 10, 11, 22))
18/10/10 11:41:47 INFO SparkContext: Invoking stop() from shutdown hook
(part_1,List(26, 33, 43, 50))
(part_2,List(54, 76))
(part_3,List(81, 94, 100))

备注:
spark分区后再使用sortby,会将数据又进行shullffe操作的,相当于再调用一次rangePatition.

调用sortby后:
(part_0,List(1, 2, 3, 3, 4, 4))
(part_1,List(5, 6, 6, 7, 8, 9))
(part_2,List(10, 11, 22, 26, 33, 43))
(part_3,List(50, 54, 76, 81, 94, 100))

2.重写分区,实现分区排序+二次归并排序实现全局排序

002---数据按照分区输出
 val data=Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 22, 33, 43, 26, 50, 81, 54, 76, 94, 100)
  spark.sparkContext.parallelize(data,1)
    .mapPartitionsWithIndex{(partionId,iter)=>
    var part_name = "part_" + partionId
    var part_map = scala.collection.mutable.Map[String,List[Int]]()
    part_map(part_name) = List[Int]()
    while(iter.hasNext){
      part_map(part_name) :+= iter.next()//:+= 列表尾部追加元素
      }
      part_map.iterator
  }
}.collect().foreach(println(_))

003--按照数据大小分区,容易造成数据倾斜
val data=Array(1, 2, 3, 4, 5, 6,6, 7, 8, 9, 10, 11, 22, 33, 43, 26, 50, 81, 54, 76, 94, 100)
  val s: RDD[(Int, Int)] =spark.sparkContext.parallelize(data,1).map(str=>(str,1))
  val mx=s.sortBy(_._1,false).first()._1
      s.partitionBy(new SortPartitoner(4,mx))
    .mapPartitionsWithIndex{(partionId,iter)=>
    var part_name = "part_" + partionId
    var part_map = scala.collection.mutable.Map[String,List[Int]]()
    part_map(part_name) = List[Int]()
    while(iter.hasNext){
      part_map(part_name) :+= iter.next()._1//:+= 列表尾部追加元素
      }
      part_map.iterator
  }
}.collect().foreach(println(_))
//自定义分区器
class SortPartitoner(num: Int,max:Int) extends Partitioner {
  override def numPartitions: Int = num
  val partitionerSize = max / num + 1
  override def getPartition(key: Any): Int = {
    val intKey = key.asInstanceOf[Int]
    intKey / partitionerSize
  }
}

##结果
(part_0,List(1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 11, 22))
(part_1,List(33, 43, 26, 50))
(part_2,List(54, 76))
(part_3,List(81, 94, 100))

3.mapPartition的几种实现方式

/**
  * mappartition的使用
  */
def test_mapPartition(): Unit ={
  val sc=spark.sparkContext
  val a:RDD[Int] = sc.parallelize(1 to 1000000,2 )
  val startTime=System.nanoTime()
  println(a.repartition(4).map(str=>str*3).sum())
  val endTime=System.nanoTime()
  println((endTime-startTime)/1000000000d)
  //01-第一种写法(内部创建一个数组,用于缓存所有的数据)
  def terFunc(iter: Iterator[Int]) : Iterator[Int] = {
    var res = List[Int]()
    while (iter.hasNext)
    {
      val cur = iter.next;
      res.::= (cur*3) ;
    }
    res.iterator
  }
  val startTime2=System.nanoTime()
  val result = a.mapPartitions(terFunc).sum()
  val endTime2=System.nanoTime()
  println(result+"=="+(endTime2-startTime2)/1000000000d)
  //02--第二种写法(自定义迭代器)
  val startTime3=System.nanoTime()
  val result2 = a.mapPartitions(v => new CustomIterator(v)).sum()
  val endTime3=System.nanoTime()
  println(result2+"=="+(endTime2-startTime2)/1000000000d)

}
//03--这也是一个mappartition的遍历操作
def mapPartitionsTest(listParam:Iterator[Int]):Iterator[Int]={
  println("by partition:")
  var res = for(param<-listParam) yield param*2
  res
}


//自定义迭代器
class CustomIterator(iter: Iterator[Int]) extends Iterator[Int] {
  override def hasNext: Boolean = {
    iter.hasNext
  }
  override def next(): Int = {
    val cur=iter.next()
    cur*3//返回值
  }
}

 

223916_bL9y_2663968.jpg
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值