产生shuffle的算子,分区操作:repartition,coalesce。‘ByKey’操作(除了counting)如:groupByKey和reduceByKey。join操作:cogroup和join
repartition源码:
/**
* Return a new RDD that has exactly numPartitions partitions.
*
* Can increase or decrease the level of parallelism in this RDD. Internally, this uses
* a shuffle to redistribute data.
*
* If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
* which can avoid performing a shuffle.
*/
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}
repartition 增加或减少RDD的并行度,是需要shuffle的,重新分发数据。
如果你想减少RDD的分区,可考虑使用coalesce,因为能够避免shuffle动作。
coalesce源码:
使用coalesce分区变大时,需要制定第二个参数为true,意思为打开shuffle。
coalesce(5,true)
def coalesce(numPartitions: Int, shuffle: Boolean = false,
partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
(implicit ord: Ordering[T] = null)
: RDD[T] = withScope {
require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive.")
if (shuffle) {
/** Distributes elements evenly across output partitions, starting from a random partition. */
val distributePartition = (index: Int, items: Iterator[T]) => {
var position = (new Random(index)).nextInt(numPartitions)
items.map { t =>
// Note that the hash code of the key will just be the key itself. The HashPartitioner
// will mod it with the number of total partitions.
position = position + 1
(position, t)
}
} : Iterator[(Int, T)]
// include a shuffle step so that our upstream tasks are still distributed
new CoalescedRDD(
new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
new HashPartitioner(numPartitions)),
numPartitions,
partitionCoalescer).values
} else {
new CoalescedRDD(this, numPartitions, partitionCoalescer)
}
}
在实际生产中,coalesce多用于在使用filter后过滤了大部分内容后,用coalesce收敛分区。
repartition则用于大文件打散,提高并行度、解决数据倾斜。
reduceByKey:
按照Key来分区,在分发到新的分区前,会在原来的分区进行一次聚合,然后在分发到新的paritition。
groupByKey:
按照K来分组,数据在分发到新的partition后进行聚合。
所以shuffle read 的时候reduceByKey的数据量要比groupByKey少。
其实reduceByKey和groupByKey都有调用combineByKeyWithClassTag.
def combineByKeyWithClassTag[C](
createCombiner: V => C,
mergeValue: (C, V) => C,
mergeCombiners: (C, C) => C,
partitioner: Partitioner,
mapSideCombine: Boolean = true,
serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] =
只是groupBeyKey调用的时候给mapSideCombine穿了一个false,关闭了map端的聚合。
mapPartitionsWithIndex():分分区,给分区加一个编号
mapPartition:
map是将函数作用到RDD里的每一个元素,然后得到一个新的RDD。对于已经处理完的每条数据,会在内存中清空掉。
mapPartition是作用到RDD里的每一个分区上面去。因为函数是作用于partition,所以当partition过大时,把它加载到内存容易出现OOM。而map一般不会出现。
foreach与foreachPartition
写到外部一般都用foreachPartition。
textFile()
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
底层调用了hadoopFile[],hadoopFile的参数传入值其实就是hadoop的map方法的传入值,K=行偏移量,V。hadoopFile通过map()将行偏移量去除,返回一个V的集合。
collect()
慎用。return an array that contains all of the elements in this RDD。使用不当会产生OOM。应该用于collect在意料之中的小结果集。因为所有的数据会被加载到driver的内存当中。