Spark基础知识02-基础算子的比较

最新推荐文章于 2023-02-17 17:04:50 发布

嘉平11

最新推荐文章于 2023-02-17 17:04:50 发布

阅读量375

点赞数 1

分类专栏： Spark

本文链接：https://blog.csdn.net/zgm12/article/details/104350408

版权

Spark 专栏收录该内容

28 篇文章 1 订阅

订阅专栏

一、map、mapPartitions、mapPartitionsWithIndex、

foreach、foreachPartition

1.map、mapPartitions比较：

map传入一个函数，将其作用到每一个元素上

def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

mappartitions:传入一个函数，将其作用到每个分区上

  def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

应用场景：

当数据量不太大的时候，可以用mapPartitions，可以提高运行效率
当数据量太大的时候，有可能会发生oom

注意：

1.mappartitions后面的那个函数参数，获得的值，和map不一样，map获得的是元素，而mappartitions这个参数函数f的传入的值是迭代器，ps:迭代器有些不能直接用，要加上toList,才能用。

2.函数返回要加上 .iterator

   val sc = new SparkContext(conf)
    val rdd1: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7,8),2)
    println( rdd1.mapPartitions(_.toList.reverse.iterator).collect().toBuffer)



结果：
ArrayBuffer(4, 3, 2, 1, 8, 7, 6, 5)

2.mapPartitionsWithIndex

传入一个参数：f函数，这个f函数的第一个参数,即index代表的是分区号码（从0开始），iterator即分区中的元素

 def mapPartitionsWithIndex[U: ClassTag](
      f: (Int, Iterator[T]) => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),
      preservesPartitioning)
  }

3.实际操作举例:

    val conf: SparkConf = new SparkConf()
    conf.setMaster("local[*]")
    conf.setAppName("mymap1")
    val sc = new SparkContext(conf)
    val rdd1: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7,8),2)
    val rdd2: RDD[Int] = rdd1.map(_*10)
    val rdd3: RDD[Int] = rdd1.mapPartitions(_.map(_*10))

    val myfunc1=(index:Int,it:Iterator[Int])=>{it.toList.map(x=>"[分区号："+index+"值："+x+"]").iterator}
//myfunc1 是实现了输出元素的分区号以及当前元素
    val rdd4: RDD[String] = rdd1.mapPartitionsWithIndex(myfunc1)

    val myfunc2=(index:Int,it:Iterator[Int])=>{it.toList.map(x=>index*10+x).iterator}
 //myfunc2 是实现了输出元素的分区号*10+当前元素的值 

val rdd5: RDD[Int] = rdd1.mapPartitionsWithIndex(myfunc2)


    println(rdd2.partitions.length)  //输出分区数
    println(rdd2.collect().toBuffer)
    println(rdd3.collect().toBuffer)
    println(rdd4.collect().toBuffer)
    println(rdd5.collect().toBuffer)

运行结果：

ArrayBuffer(10, 20, 30, 40, 50, 60, 70, 80)
ArrayBuffer(10, 20, 30, 40, 50, 60, 70, 80)
ArrayBuffer([分区号：0值：1], [分区号：0值：2], [分区号：0值：3], [分区号：0值：4], [分区号：1值：5], [分区号：1值：6], [分区号：1值：7], [分区号：1值：8])
ArrayBuffer(1, 2, 3, 4, 15, 16, 17, 18)

4、foreach、forechPartition

mapPartitions和map是transform算子，分别返回一个iterator迭代器和RDD。

foreachPartition、foreach是action算子，无返回值。用于结果的输出操作

mapPartitions、foreachPartition中定义的是一个RDD的每一个分区的统一处理逻辑，每个分区中共用的变量和对象。map、foreach中定义的是一个RDD中每一条记录的转换和处理逻辑。
foreach中的处理逻辑是串行的，map中的处理逻辑是并行的，具体与CPU核数有关，如果是1核则是串行处理。

一般要连接数据库的时候，选择foreachPartition算子

foreach和foreachPartition都是action算子，在excutor端执行计算，foreach是一条一条的输出到目标库，那么每输出一条，就要连接一次数据库，而foreachPartition返回的是一个Partition的元素，每次一个分区连接一次数据库。这样效率就高非常多，但是由于一次就收集一个分区的数据，所以也有可能会导致OOM，采取的办法可以扩大分区数。

//错误的使用示范：

val dStream = periodic.foreachRDD(rdd => {
     //val jedis = RedisUtil.getConnectionFromPool   数据库连接定义在这里对象会被序列化，无法使用
     val pStream = rdd.mapPartitions(partition => {
          val newPartition = partition.map(jsonobject => {
          // val jedis = RedisUtil.getConnectionFromPool //会为每一条RDD记录操作创建一个数据库连接，会将数据库连接池用完导致TimeOut异常
           //做一些处理
           ...
           jsonobject
       })
       jedis.close() //map是惰性加载机制，此处会有连接对象没有被使用就被关闭的风险，导致连接对象在真正需要使用时报空指针异常
      // newPartition.toIterator 此处若注释掉将导致整个mapPartitions操作失效
     }).persist() 

     //foreachPartition连接对象的错误位置与mapPartitions一样
     pStream.foreachPartition(partition => {
       val mongoClient = mongoClient()
       val jedis = RedisUtil.getConnectionFromPool
       val pipeline = jedis.pipelined()
       val commandList = new util.ArrayList[WriteModel[Document]]()
       partition.foreach(line => {
         //做一些处理
         ...
       })
       //尽量减少数据库连接数量，对数据库执行批处理操作
       if(pipeline!=null){
         pipeline.sync()
         pipeline.close()
       }
       jedis.close()
       if (commandList.size() > 0) {
         val result = mongoClient.getMongoColl(dbName, devDictionaryCollName).bulkWrite(commandList, new BulkWriteOptions().ordered(false))
         println("*** 批量执行结果： ***" + result.toString)
       }
     })
     pStream.unpersist()
   })


//正确的使用示范：
val dStream = periodic.foreachRDD(rdd => {
      val pStream = rdd.mapPartitions(partition => {
        val jedis = RedisUtil.getConnectionFromPool //正确的连接定义位置
        val newPartition = partition.map(jsonobject => {
            //做一些处理
            ...
            jsonobject
        }).toList //由于map操作时惰性加载，所以如果不进行action算子触发，会导致下一行jedis.close()报空指针异常
        jedis.close()
        newPartition.toIterator
      }).persist() //把rdd持久化到内存中用于后面多次action算子操作

      //对transform的RDD执行action输出操作
      pStream.foreachPartition(partition => {
        val mongoClient = mongoClient()
        val jedis = RedisUtil.getConnectionFromPool
        val pipeline = jedis.pipelined()
        val commandList = new util.ArrayList[WriteModel[Document]]()
        partition.foreach(line => {
          //做一些处理
          ...
        })
        //尽量减少数据库连接数量，对数据库执行批处理操作
        if(pipeline!=null){
          pipeline.sync()
          pipeline.close()
        }
        jedis.close()
        if (commandList.size() > 0) {
          val result = mongoClient.getMongoColl(dbName, devDictionaryCollName).bulkWrite(commandList, new BulkWriteOptions().ordered(false))
          println("*** 批量执行结果： ***" + result.toString)
        }
      })
      pStream.unpersist()
    })

二、groupByKey、reduceByKey、aggregate、aggregateByKey、combineByKey

1.groupByKey

  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }

此处 groupByKey(defaultPartitioner(self))底层代码如下：


 def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {
    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

2.reduceByKey

  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }


  //此处的reduceByKey(defaultPartitioner(self), func)底层代码如下：

 def reduceByKey(partitioner: Partitioner, 
func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

那么相对于groupByKey而言，reduceByKey优化了在相同partition中相同key先进行reduce，然后再在不同partition中进行reduce。reduce和groupby的输入参数也不一样，参数可以决定我们返回值的类型，通过自定义的参数Func,可以灵活的配置我们以怎样的方式组合value。

相同点：
1,都作用于 RDD[K,V]
2，都是根据key来分组聚合
3，默认，分区的数量都是不变的，但是都可以通过参数来指定分区数量

不同点：
1， groupByKey默认没有聚合函数，得到的返回值类型是RDD[ k,Iterable[V]]
2， reduceByKey 必须传聚合函数得到的返回值类型 RDD[(K,聚合后的V)]
3， groupByKey().map() = reduceByKey

最重要的区别：
reduceByKey 会进行分区内聚合，然后再按照相同的key对应的数据进行全局聚合
groupByKey 不会进行局部聚合，只进行分组

结论：
如果这两个算子，都可以使用，优先使用reduceByKey

3、aggregate

aggregate
(zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

举例分析：

val rdd1: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6),2)
val r2: Int = rdd1.aggregate(5)(math.max(_,_),_+_)

在0号分区中：5 先和1 比较大小，得到5；然后得到的5 和2比较大小，得到5；然后得到的5和3比较大小，得到5

在1号分区中：5 先和4 比较大小，得到5；然后得到的5 和5比较大小，得到5；然后得到的5和6比较大小，得到6

然后第二个函数做计算时，先将初始值，也就是此处的5先放在左边，5+其中一个分区的结果+另外一个分区的结果=16

一定注意：初始值不仅在第一个函数用到，第二个函数也会用到

第二个函数进行分区间运算的时候，不一定是哪个分区的结果先放在前面！！！随机的

    val conf: SparkConf = new SparkConf()
    conf.setMaster("local[*]")
    conf.setAppName("myaggregate")
    val sc = new SparkContext(conf)
    val rdd1: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6),2)
    def func1(index: Int, iter: Iterator[(Int)]) : Iterator[String] = {
      iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator
    }
    println(rdd1.mapPartitionsWithIndex(func1).collect().toBuffer)
    val r1: Int = rdd1.aggregate(0)(math.max(_,_),_+_)
    val r2: Int = rdd1.aggregate(5)(math.max(_,_),_+_)
    println("r1:"+r1)        
    println("r2:"+r2)


    val rdd2 = sc.parallelize(List("a","b","c","d","e","f"),2)
    def func2(index: Int, iter: Iterator[(String)]) : Iterator[String] = {
      iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator
    }
   println( rdd2.mapPartitionsWithIndex(func2).collect.toBuffer)
    val r3: String = rdd2.aggregate("")(_ + _, _ + _)
    val r4: String = rdd2.aggregate("=")(_ + _, _ + _)
    println("r3:"+r3)
    println("r4:"+r4)

结果：

ArrayBuffer([partID:0, val: 1], [partID:0, val: 2], [partID:0, val: 3], [partID:1, val: 4], [partID:1, val: 5], [partID:1, val: 6])
r1:9
r2:16
ArrayBuffer([partID:0, val: a], [partID:0, val: b], [partID:0, val: c], [partID:1, val: d], [partID:1, val: e], [partID:1, val: f])
r3:abcdef        //或者是defabc
r4:==abc=def     //或者是==def=abc  //顺序不一定，都有可能

检验：实现以下需求：

需求：统计参与计算的数据的数量和总和
数据：val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9), 3)
结果: (9, 45)

 val a = sc.parallelize(Array(1,2,3,4,5,6,7,8,9), 3)
     val tuple: (Int, Int) = a.aggregate((0,0))((x, y)=>(x._1+1,x._2+y), (i, j)=>(i._1+j._1,i._2+j._2))
     println(tuple)

4、aggregateByKey

aggregateByKey

(zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)

第二个是指分区器，可以不传，使用默认分区器

 def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
  }


此处  aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
底层代码如下：
  def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
      combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
    val zeroArray = new Array[Byte](zeroBuffer.limit)
    zeroBuffer.get(zeroArray)

    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
    val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

    // We will clean the combiner closure later in `combineByKey`
    val cleanedSeqOp = self.context.clean(seqOp)
    combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
      cleanedSeqOp, combOp, partitioner)
  }

注意：

1.初始值只作用在分区内的计算，不作用在分区间的函数上

分区内，每个key开始的时候，将初始值代入分区内函数的第一个参数

2.aggregateByKey作用于k,v类型的数据；分区内函数、分区间函数的参数都代表的是（k,v）中的v

举例：


    val b = sc.parallelize(List(("mouse", 2),("cat",200), ("cat", 5),
                                 ("mouse", 4),("cat", 12), ("dog", 12)), 2)
    def func3(index: Int, iter: Iterator[(String, Int)]) : Iterator[String] = {
      iter.toList.map(x => "[partID:" +  index + ", val: " + x + "]").iterator
    }
   println(b.mapPartitionsWithIndex(func3).collect.toBuffer)
    val r: RDD[(String, Int)] = b.aggregateByKey(100)(math.max(_,_),_+_)
     println(r.collect().toBuffer)

运行结果：

ArrayBuffer([partID:0, val: (mouse,2)], [partID:0, val: (cat,200)], [partID:0, val: (cat,5)], [partID:1, val: (mouse,4)], [partID:1, val: (cat,12)], [partID:1, val: (dog,12)])
ArrayBuffer((dog,100), (cat,300), (mouse,200))

过程分析：

ArrayBuffer(
[partID:0, val: (mouse,2)],[partID:0, val: (cat,200)], [partID:0, val: (cat,5)],
 
 [partID:1, val: (mouse,4)], [partID:1, val: (cat,12)],[partID:1, val: (dog,12)])
 


val r: RDD[(String, Int)] = 
b.aggregateByKey(100)(math.max(_,_),_+_)

------------------
分区内计算：
分区0：
mouse  100>2       100
cat    100<200     200
cat     200>5      200

分区1:
mouse 100>4   100
cat  100>12   100
dog  100>12    100

----------------
分区间：
mouse  100+100 =200
cat   200+100=300
dog   0+100=100

5.combineByKey

combineByKey
(createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C)

  def combineByKey[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C): RDD[(K, C)] = self.withScope {
    combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners)(null)
  }

和aggregateByKey一样，可以返回和输入不同类型的value（

combineByKey参数解释：

（1）createCombiner:
combineByKey() 会遍历分区中的所有元素，因此每个元素的键要么还没有遇到过，要么就和之前的某个元素的键相同。如果这是一个新的元素,combineByKey()会使用一个叫作createCombiner()的函数来创建那个键对应的累加器的初始值

（2）mergeValue:
如果这是一个在处理当前分区之前已经遇到的键，它会使用mergeValue()方法将该键的累加器对应的当前值与这个新的值进行合并

（3）mergeCombiners:
由于每个分区都是独立处理的，因此对于同一个键可以有多个累加器。如果有两个或者更多的分区都有对应同一个键的累加器，就需要使用用户提供的 mergeCombiners() 方法将各个分区的结果进行合并。

和aggregateByKey不同的是，初始值，变成了初始函数

一样的是，初始函数也只作用在分区内

举例：

   val rdd5 = sc.textFile("hdfs://mini1:9000/data/wcount").flatMap(_.split(" ")).map((_,1))
    def func4(index:Int,iter:Iterator[(String,Int)]):Iterator[String] ={
      iter.toList.map(x=>"[partitionID:"+index+"value:"+x+"]").iterator
    }
    println("rdd5的分区数"+rdd5.partitions.length)
     println(rdd5.mapPartitionsWithIndex(func4).collect().toBuffer)
    val rdd6: RDD[(String, Int)] = rdd5.combineByKey(x=>x+10, (a:Int, b:Int)=>a+b, (i:Int, j:Int)=>i+j)
    println(rdd6.collect().toBuffer)


ArrayBuffer([partitionID:0value:(hello,1)], 
[partitionID:0value:(china,1)], [partitionID:0value:(hello,1)], 
[partitionID:0value:(jack,1)], [partitionID:0value:(hello,1)],
 [partitionID:0value:(jim,1)], [partitionID:0value:(hello,1)], 
[partitionID:0value:(tom,1)], 
[partitionID:1value:(hello,1)], [partitionID:1value:(hadoop,1)], 
[partitionID:1value:(hello,1)],
 [partitionID:1value:(china,1)], [partitionID:1value:(nihao,1)],
 [partitionID:1value:(zhongguo,1)], [partitionID:1value:(mingbai,1)],
 [partitionID:1value:(lala,1)], [partitionID:1value:(weiwei,1)], 
[partitionID:1value:(xingfen,1)], [partitionID:1value:(xiaoguo,1)], 
[partitionID:2value:(mimi,1)], [partitionID:2value:(lala,1)],
 [partitionID:2value:(mingbai,1)])

ArrayBuffer((xiaoguo,11), (tom,11), (weiwei,11), (hadoop,11),
 (hello,26), (jack,11), (jim,11), (mingbai,22),
 (lala,22), (zhongguo,11), (mimi,11), (china,22), (xingfen,11), (nihao,11))

举例二：

筛选单身狗和非单身狗

    val rdd4 = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)
    val rdd5 = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)
    val rdd6 = rdd5.zip(rdd4)
    val rdd7 = rdd6.combineByKey(List(_), (x: List[String], y: String)
    => x :+ y, (m: List[String], n: List[String]) => m ++ n)
   println(rdd7.collect().toBuffer)


运行结果：
ArrayBuffer((1,List(dog, cat, turkey)), (2,List(gnu, salmon, rabbit, wolf, bear, bee)))

三、coalesce、repartition、repartitionAndSortWithinPartitions、partitonBy

这两重分区函数，实际上更改的是并行度

coalesce默认不产生shuffle，即分区数不可以变大；想要变大，就传入第二个参数，设置为true,即允许产生shuffle

repartition：调用了coalesce函数，并且默认coalesce的第二个参数为true，所以可以扩大也可以缩小分区数

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  coalesce(numPartitions, shuffle = true)
}

  val conf: SparkConf = new SparkConf()
  conf.setMaster("local[*]")
  conf.setAppName("myaggregate")
  val sc = new SparkContext(conf)

    val rdd1: RDD[Int] = sc.parallelize(List(1,2,3,4,5,6,7,8,9),3)
    println(rdd1.partitions.length)
    val rdd2: RDD[Int] = rdd1.coalesce(2)
    println(rdd1.partitions.length)
    println(rdd2.partitions.length)

    val rdd3: RDD[Int] = rdd1.coalesce(4)
    println(rdd3.partitions.length)   //修改不成功
    val rdd33: RDD[Int] = rdd1.coalesce(4,true)
    println(rdd33.partitions.length)   //修改成功

    val rdd4 = rdd2.repartition(2)
    println(rdd4.partitions.length)
    val rdd5 = rdd2.repartition(4)
    println(rdd5.partitions.length)      
运行结果：
3
3
2
3
4
2
4

3、repartitionAndSortWithinPartitions

传入一个分区器，并且重新排序，结果显示，只是按照key的哈希值进行排序，value的值大小对结果没有影响

  val rdd6: RDD[(String, Int)] = sc.parallelize(List(("apple",8),("caption",9),
("dog",4),("cat",34),("fish",5),("apen",1),("drink",23),("car",6)),3)
    rdd6.repartitionAndSortWithinPartitions(new HashPartitioner(2)).foreach(println)

运行结果：
(apen,1)
(apple,8)
(caption,9)
(car,6)
(cat,34)
(dog,4)
(drink,23)
(fish,5)

4.partitionBy

可以将自己定义的分区器传入

嘉平11

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Spark基础知识02-基础算子的比较

一、map、mapPartitions、mapPartitionsWithIndex1.map、mapPartitions比较：map传入一个函数，将其作用到每一个元素上def map[U: ClassTag](f: T => U): RDD[U] = withScope { val cleanF = sc.clean(f) new MapPartitions...
复制链接

扫一扫