Spark 算子调优

最新推荐文章于 2020-06-10 23:48:14 发布

qzqanlhy1314

最新推荐文章于 2020-06-10 23:48:14 发布

阅读量234

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/qzqanzc/article/details/99678059

版权

本文探讨了几个关键的Spark算子优化策略，包括在需要频繁创建额外对象时使用mapPartitions或mapPartitionWithIndex替换map，用foreachPartition替换foreach以减少对象创建，用coalesce代替repartition以减少shuffle，以及利用treeReduce和treeAggregate来减轻Driver端计算负担。通过这些优化，可以提升Spark应用的性能和效率。

摘要由CSDN通过智能技术生成

使用mapPartitions 或者 mapPartitionWithIndex 替换map 操作

在映射的过程中需要频繁创建额外对象的时候（数据库，网络TCP等IO连接，文件流等）

mapPartitions 按照分区创建额外的对象
map 按照元素创建额外对象
mapPartitionsWithIndex 与mapPartitions基本相同，只是处理参数是一个二元组，元组的第一个元素是当前处理分区的index,元组的第二个元素是处理当前分区元素的Iterator

使用foreachPartition 替换foreach

在迭代过程中需要频繁创建额外对象的时候（数据库，网络TCP等IO连接，文件流等）

/**
* Applies a function f to all elements of this RDD.  
*/
def foreach(f: T => Unit): Unit = withScope {
	val cleanF = sc.clean(f)
	sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}


/**
   * Applies a function f to each partition of this RDD.
   */
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
  }

使用coalesce 取代repartition

有些时候需要重新设置RDD分区数量可以使用coalesce 和 repartition

coalesce源码如下

 /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuff

最低0.47元/天解锁文章

qzqanlhy1314

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark 算子调优

使用mapPartitions 或者 mapPartitionWithIndex 替换map 操作在映射的过程中需要频繁创建额外对象的时候（数据库，网络TCP等IO连接，文件流等）mapPartitions 按照分区创建额外的对象map 按照元素创建额外对象mapPartitionsWithIndex 与mapPartitions基本相同，只是处理参数是一个二元组，元组的第一个元素是当...
复制链接

扫一扫