SparkCore RDD(二）

最新推荐文章于 2022-11-22 06:06:28 发布

冬瓜螺旋雪碧

最新推荐文章于 2022-11-22 06:06:28 发布

阅读量176

点赞数 1

分类专栏： Spark 文章标签： spark rdd sparkcore rdd

本文链接：https://blog.csdn.net/kzw11/article/details/101352732

版权

Spark 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

文章目录

结合Spark Core源码浅析RDD (此示例均在spark-shell中运行，spark版本为2.4.4）

1 map 与 mapPartitions

     /**
      * map 处理每一条数据
      * mapPartitions 对每个分区进行处理
      *
      */
scala>  val rdd = sc.parallelize(List(1,2,3,4,5))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.map(_*2).collect
res0: Array[Int] = Array(2, 4, 6, 8, 10)

scala> rdd.mapPartitions(partition => partition.map(_*2)).collect
res1: Array[Int] = Array(2, 4, 6, 8, 10）

小结：
区别在于假如    有100 个元素 ，10个分区
map 需要处理100次，而 mapPartitions 只做10次处理。
由此可引申出，假如将RDD中的数据写入到MySQL中，肯定选用mapPartitions,因为需要获得的连接次数少，对资源开销就少。

2 mapPartitionsWithIndex 与 mapValues

//生成元素和元素的分区号
scala> rdd.mapPartitionsWithIndex((index, partition) =>{
     |       partition.map(x => s"分区编号是$index, 元素是$x")
     | }).collect
res2: Array[String] = Array(分区编号是0, 元素是1, 分区编号是0, 元素是2, 分区编号是1, 元素是3, 分区编号是1, 元素是4, 分区编号是1, 元素是5)

 //mapValues 是针对RDD[K,V]的V做处理
scala> sc.parallelize(List(("aa",30),("bb",18))).mapValues(_ + 1).collect
res4: Array[(String, Int)] = Array((aa,31), (bb,19))

3 map 与 flatmap

//map 不改变原来的rdd的结构只对rdd做处理
scala> sc.parallelize(List(List(1,2),List(3,4))).map(x=>x.map(_ * 2)).collect
res6: Array[List[Int]] = Array(List(2, 4), List(6, 8))

// flatmap = map + flatten 将元素铺平在做计算
scala> sc.parallelize(List(List(1,2),List(3,4))).flatMap(x=>x.map(_*2)).collect
res7: Array[Int] = Array(2, 4, 6, 8)

4 sample 源码分析

scala>  sc.parallelize(1 to 30).sample(true, 0.4, 2).collect
res11: Array[Int] = Array(1, 2, 2, 6, 6, 10, 11, 11, 15, 24, 28, 29, 30)

/**
 * 返回此RDD的抽样子集。
 * @note 这并不能保证提供的只是给定[[RDD]]的分数。
 */
def sample(
    withReplacement: Boolean,
    fraction: Double,
    seed: Long = Utils.random.nextLong): RDD[T] = {
  require(fraction >= 0,
    s"Fraction must be nonnegative, but got ${fraction}")

  withScope {
    require(fraction >= 0.0, "Negative fraction value: " + fraction)
    if (withReplacement) {
      new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
    } else {
      new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
    }
  }
}

参数说明:
1、withReplacement：元素可以多次抽样(在抽样时替换)

2、fraction：期望样本的大小作为RDD大小的一部分，
当withReplacement=false时：选择每个元素的概率;分数一定是[0,1] ；
当withReplacement=true时：选择每个元素的期望次数; 分数必须大于等于0。

3、seed：随机数生成器的种子

5 filter

// filter 留下满足条件的
scala> sc.parallelize(1 to 30).filter(_ > 20).collect
res13: Array[Int] = Array(21, 22, 23, 24, 25, 26, 27, 28, 29, 30)

6 distinct源码分析

scala> rdd.distinct(4).mapPartitionsWithIndex((index,partition)=>{
     |       partition.map(x => s"分区是$index,元素是$x")
     |     }).collect
res14: Array[String] = Array(分区是0,元素是4, 分区是1,元素是1, 分区是1,元素是5, 分区是2,元素是2, 分区是3,元素是3)

distinct源码分析
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
 map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)
}
两个柯里化参数，第一个参数是分区数，第二个参数是一个隐式参数排序用。
在对源码中的方法进行举例分析：
scala> b.map(x => (x,null)).reduceByKey((x,y) => x).map(_._1).collect
res17: Array[Int] = Array(4, 6, 8, 3, 7, 5)
其实就是先对每个元素进行变形成为 （x,null）形状
（3，null）
（4，null）
（5，null）
...
(8,null)
(8,null)

经过reduceByKey后 两两组合到一起，就能达到去重效果，这一步是关键点。

分析：
2 个分区 其实就是  元素 %partitions
元素4,6,8 % 2 = 0 就属于分区 0
3,5,7 % 2 =1 就属于分区 1

在回过头看第一个结果就很容易get了

7 groupByKey 与 reduceByKey


scala> sc.parallelize(List(("a",1),("b",2),("c",3),("a",99))).groupByKey().collect
res21: Array[(String, Iterable[Int])] = Array((b,CompactBuffer(2)), (a,CompactBuffer(1, 99)), (c,CompactBuffer(3)))

scala> sc.parallelize(List(("a",1),("b",2),("c",3),("a",99))).reduceByKey(_+_).collect
res22: Array[(String, Int)] = Array((b,2), (a,100), (c,3))

groubByKey:
① 该操作可能非常昂贵。如果要分组以便对每个键执行聚合（例如求和或平均），则使用`PairRDDFunctions.aggregateByKey` 或`PairRDDFunctions.reduceByKey`将提供更好的性能
② 按照当前的实现，groupByKey必须能够在内存中保存任何*键的所有键值对。如果键的值太多，则可能导致OutOfMemoryError
reduceByKey:
使用关联和可交换的归约函数合并每个键的值。这还将在将结果发送给reducer之前，在每个Mapper上本地执行合并，类似于与MapReduce中的 combiner

8 groupBy 与 sortBy

//  groupBy：自定义分组  分组条件就是自定义传进去的 (不推荐使用，理由同上）
scala> sc.parallelize(List("a","a","a","b","b","c")).groupBy(x=>x).collect
res26: Array[(String, Iterable[String])] = Array((b,CompactBuffer(b, b)), (a,CompactBuffer(a, a, a)), (c,CompactBuffer(c)))

// 根据给定的key进行排序，底层调用的就是 sortByKey
scala> sc.parallelize(List(("ruoze",30),("J哥",18),("星星",60))).sortBy(_._2).collect
res27: Array[(String, Int)] = Array((J哥,18), (ruoze,30), (星星,60))

9 join 与 cogroup

val a = sc.parallelize(List(("若泽","北京"),("J哥","上海"),("仓老师","杭州")))
val c = sc.parallelize(List(("若泽","30"),("J哥","18"),("星星","60")))

//全连接，
scala> a.fullOuterJoin(c).collect
res31: Array[(String, (Option[String], Option[String]))] = Array((星星,(None,Some(60))), (若泽,(Some(北京),Some(30))), (仓老师,(Some(杭州),None)), (J哥,(Some(上海),Some(18))))

//左连接，以a为主
scala> a.leftOuterJoin(c).collect
Array[(String, (String, Option[String]))] = Array((若泽,(北京,Some(30))), (仓老师,(杭州,None)), (J哥,(上海,Some(18))))

//右连接，以c为主
scala> a.rightOuterJoin(c).collect
res32: Array[(String, (Option[String], String))] = Array((星星,(None,60)), (若泽,(Some(北京),30)), (J哥,(Some(上海),18)))

//cogroup
scala> a.cogroup(c).collect
res33: Array[(String, (Iterable[String], Iterable[String]))] = Array((星星,(CompactBuffer(),CompactBuffer(60))), (若泽,(CompactBuffer(北京),CompactBuffer(30))), (仓老师,(CompactBuffer(杭州),CompactBuffer())), (J哥,(CompactBuffer(上海),CompactBuffer(18))))


源码：
//返回一个包含所有在`this`和`other`中具有匹配键的元素对的RDD。每对元素将以（k，（v1，v2））元组返回，其中（k，v1）在“ this”中，而（k，v2）在“ other”中。使用给定的分区程序对输出RDD进行分区
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues( pair =>
      for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)
    )
  }

//执行“ this”和“ other”的左外部联接。对于this中的每个元素（k，v），生成的RDD将包含other中w的所有对（k，（v，Some（w）））或对（k，（v ，无）），如果“其他”中没有元素具有键k。使用给定的Partitioner对输出RDD进行分区
def leftOuterJoin[W](
      other: RDD[(K, W)],
      partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {
    this.cogroup(other, partitioner).flatMapValues { pair =>
      if (pair._2.isEmpty) {
        pair._1.iterator.map(v => (v, None))
      } else {
        for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))
      }
    }
}

//对于“ this”，“ other1”或“ other2”中的每个键k，返回包含元组的结果RDD，其中包含在“ this”，“ other1”和“ other2”中该键的值列表
def cogroup[W1, W2](other1: RDD[(K, W1)], other2: RDD[(K, W2)], partitioner: Partitioner)
      : RDD[(K, (Iterable[V], Iterable[W1], Iterable[W2]))] = self.withScope {
    if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {
      throw new SparkException("HashPartitioner cannot partition array keys.")
    }
    val cg = new CoGroupedRDD[K](Seq(self, other1, other2), partitioner)
    cg.mapValues { case Array(vs, w1s, w2s) =>
      (vs.asInstanceOf[Iterable[V]],
        w1s.asInstanceOf[Iterable[W1]],
        w2s.asInstanceOf[Iterable[W2]])
    }
  }
总结：
 	/**
      * join底层就是使用了cogroup
      * RDD[K,V]
      *
      * 根据key进行关联，返回两边RDD的记录，没关联上的是空
      * join返回值类型  RDD[(K, (Option[V], Option[W]))]
      * cogroup返回值类型  RDD[(K, (Iterable[V], Iterable[W]))]
      */

10 zipWithIndex

scala> rdd.zipWithIndex.collect
res43: Array[(Int, Long)] = Array((1,0), (2,1), (3,2), (4,3), (5,4))

11 first() 与 take()

scala> val rdd = sc.parallelize(List(1,2,3,4,5))
scala> rdd.first()
res5: Int = 1
// 返回此RDD中的第一个元素   底层抵用的还是take 方法
def first(): T = withScope {
  take(1) match {
    case Array(t) => t
    case _ => throw new UnsupportedOperationException("empty collection")
  }
}


// take 方法： 仅当预期结果数组较小时才应使用此方法，因为所有数据都已加载到驱动程序的内存中
scala> rdd.take(2)
res9: Array[Int] = Array(1, 2)

12 top() 与 takeOrdered()

scala> val rdd = sc.parallelize(List(1,2,3,4,5))
scala> rdd.top(2) 
res10: Array[Int] = Array(5, 4)

// 根据指定的隐式Ordering [T]返回此RDD中的前k个（最大）元素，并保持该顺序。与此相反  底层调用的就是takeOrdered方法
// 注意：仅当预期结果数组较小时才应使用此方法，因为所有数据均已加载到驱动程序的内存中
def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    takeOrdered(num)(ord.reverse)//其实就是将排序规则反转一下
}

// 根据指定的隐式Ordering [T]定义，从此RDD返回前k个（最小）元素，并保持该顺序。这与[[top]]相反
// 注意：仅当预期结果数组较小时才应使用此方法，因为所有数据均已加载到驱动程序的内存中
	/**
	*	sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)
   *   // returns Array(2, 3)
   * */
scala> rdd.takeOrdered(2)
res11: Array[Int] = Array(1, 2)

13 repartition() 与 coalesce()

// getNumPartitions 获得分区数
scala> rdd.getNumPartitions
res19: Int = 8

scala> val r4 = rdd.repartition(5)
scala> r4.getNumPartitions
res21: Int = 5

coalesce:
scala> val r5 = r4.coalesce(6)
scala> r5.getNumPartitions
res24: Int = 5 // 可以看到分区数并没有改变 还是5,按照常理分区数应该为6的

scala> val r6 = r5.coalesce(3)
scala> r6.getNumPartitions
res26: Int = 3 // 3 < 源分区数5 ，分区数却能够改变

源码分析：
// 返回具有完全numPartitions分区的新RDD, 底层源码就是调用了 coalesce 方法
// 可以增加或减少此RDD中的并行度。在内部，这使用shuffle来重新分配数据
 def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
   coalesce(numPartitions, shuffle = true)
 }

// 这是一个窄依赖，如果从1000个分区扩展到100个分区，则不会进行shuffle，而是100个新分区中的每一个将占用当前分区中的10个。如果请求的分区数量更多，它将保持在当前分区数量（也就是分区数不变）
// 如果想要分区数增大，则必须传入两个参数，第二个参数为shuffle = true，要进行shuffle
def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {  // 如果传进来的shuffle为true
      /**从随机分区开始，在输出分区之间均匀分配元素. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](
          mapPartitionsWithIndexInternal(distributePartition, isOrderSensitive = true),
          new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
}


再来：
// shuffle默认为false   传shuffle为true，就和repartition一样
scala> val r7 = rdd.coalesce(5,true)
scala> r7.getNumPartitions
res2: Int = 5   // 分区数变大

冬瓜螺旋雪碧

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
SparkCore RDD(二）

文章目录1 map 与 mapPartitions2 mapPartitionsWithIndex 与 mapValues3 map 与 flatmap4 sample 源码分析5 filter6 distinct源码分析7 groupByKey 与 reduceByKey8 groupBy 与 sortBy9 join 与 cogroup结合Spark Core源码浅析RDD (此示例均在sp...
复制链接

扫一扫