RDD常用算子的一些注意要点。

最新推荐文章于 2023-03-31 11:34:07 发布

5xh

最新推荐文章于 2023-03-31 11:34:07 发布

阅读量366

点赞数

分类专栏： spark

本文链接：https://blog.csdn.net/qq_37283909/article/details/90180770

版权

spark 专栏收录该内容

15 篇文章 0 订阅

订阅专栏

产生shuffle的算子，分区操作：repartition，coalesce。‘ByKey’操作（除了counting）如：groupByKey和reduceByKey。join操作：cogroup和join
repartition源码：

 /**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   */
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

repartition 增加或减少RDD的并行度，是需要shuffle的，重新分发数据。
如果你想减少RDD的分区，可考虑使用coalesce，因为能够避免shuffle动作。

coalesce源码：
使用coalesce分区变大时，需要制定第二个参数为true，意思为打开shuffle。
coalesce(5,true)

def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = (new Random(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

在实际生产中，coalesce多用于在使用filter后过滤了大部分内容后，用coalesce收敛分区。
repartition则用于大文件打散，提高并行度、解决数据倾斜。

reduceByKey：
按照Key来分区，在分发到新的分区前，会在原来的分区进行一次聚合，然后在分发到新的paritition。

groupByKey：
按照K来分组，数据在分发到新的partition后进行聚合。

所以shuffle read 的时候reduceByKey的数据量要比groupByKey少。
其实reduceByKey和groupByKey都有调用combineByKeyWithClassTag.

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] =

只是groupBeyKey调用的时候给mapSideCombine穿了一个false，关闭了map端的聚合。

mapPartitionsWithIndex():分分区，给分区加一个编号

mapPartition：
map是将函数作用到RDD里的每一个元素，然后得到一个新的RDD。对于已经处理完的每条数据，会在内存中清空掉。
mapPartition是作用到RDD里的每一个分区上面去。因为函数是作用于partition，所以当partition过大时，把它加载到内存容易出现OOM。而map一般不会出现。

foreach与foreachPartition
写到外部一般都用foreachPartition。

textFile()

def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

底层调用了hadoopFile[]，hadoopFile的参数传入值其实就是hadoop的map方法的传入值，K=行偏移量，V。hadoopFile通过map（）将行偏移量去除，返回一个V的集合。

collect（）
慎用。return an array that contains all of the elements in this RDD。使用不当会产生OOM。应该用于collect在意料之中的小结果集。因为所有的数据会被加载到driver的内存当中。