Spark 算子调优

本文探讨了几个关键的Spark算子优化策略,包括在需要频繁创建额外对象时使用mapPartitions或mapPartitionWithIndex替换map,用foreachPartition替换foreach以减少对象创建,用coalesce代替repartition以减少shuffle,以及利用treeReduce和treeAggregate来减轻Driver端计算负担。通过这些优化,可以提升Spark应用的性能和效率。
摘要由CSDN通过智能技术生成

使用mapPartitions 或者 mapPartitionWithIndex 替换map 操作

在映射的过程中需要频繁创建额外对象的时候(数据库,网络TCP等IO连接,文件流 等)

  • mapPartitions 按照分区创建额外的对象
  • map 按照元素创建额外对象
  • mapPartitionsWithIndex 与mapPartitions基本相同,只是处理参数是一个二元组,元组的第一个元素是当前处理分区的index,元组的第二个元素是处理当前分区元素的Iterator

使用foreachPartition 替换foreach

在迭代过程中需要频繁创建额外对象的时候(数据库,网络TCP等IO连接,文件流等)

/**
* Applies a function f to all elements of this RDD.  
*/
def foreach(f: T => Unit): Unit = withScope {
	val cleanF = sc.clean(f)
	sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}


/**
   * Applies a function f to each partition of this RDD.
   */
  def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
    val cleanF = sc.clean(f)
    sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
  }

使用coalesce 取代repartition

有些时候需要重新设置RDD分区数量可以使用coalesce 和 repartition

coalesce源码如下

 /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuff
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值