使用mapPartitions 或者 mapPartitionWithIndex 替换map 操作
在映射的过程中需要频繁创建额外对象的时候(数据库,网络TCP等IO连接,文件流 等)
- mapPartitions 按照分区创建额外的对象
- map 按照元素创建额外对象
- mapPartitionsWithIndex 与mapPartitions基本相同,只是处理参数是一个二元组,元组的第一个元素是当前处理分区的index,元组的第二个元素是处理当前分区元素的Iterator
使用foreachPartition 替换foreach
在迭代过程中需要频繁创建额外对象的时候(数据库,网络TCP等IO连接,文件流等)
/**
* Applies a function f to all elements of this RDD.
*/
def foreach(f: T => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))
}
/**
* Applies a function f to each partition of this RDD.
*/
def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {
val cleanF = sc.clean(f)
sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))
}
使用coalesce 取代repartition
有些时候需要重新设置RDD分区数量可以使用coalesce 和 repartition
coalesce源码如下
/**
* Return a new RDD that is reduced into `numPartitions` partitions.
*
* This results in a narrow dependency, e.g. if you go from 1000 partitions
* to 100 partitions, there will not be a shuffle, instead each of the 100
* new partitions will claim 10 of the current partitions. If a larger number
* of partitions is requested, it will stay at the current number of partitions.
*
* However, if you're doing a drastic coalesce, e.g. to numPartitions = 1,
* this may result in your computation taking place on fewer nodes than
* you like (e.g. one node in the case of numPartitions = 1). To avoid this,
* you can pass shuffle = true. This will add a shuffle step, but means the
* current upstream partitions will be executed in parallel (per whatever
* the current partitioning is).
*
* @note With shuff