aggregate算子签名:
def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope
例如:
def seqOP(a: Int, b: Int): Int = {
println("seqOp: " + a + "\t" + b)
math.min(a, b)
}
def combOp(a: Int, b: Int): Int = {
println("combOp: " + a + "\t" + b)
a + b
}
计算:
println(sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8), 2).aggregate(3)(seqOP, combOp))
println(sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12), 2).aggregate(3)(seqOP, combOp))
两个结果输出都是:7,接下来我们分析一下执行过程:
sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8,1), 2).aggregate(3)(seqOP, combOp)执行过程:首先是对序列生成两个分区的RDD,对每一个分区RDD计算最小值,在计算1,2,3,4时候,zeaoValue=3参与求最小值,所以此时最小值是1,对于5,6,7,8计算最小值时候3也参与计算最小值,所以此时最小值为3.最后调用combOp函数时候3再一次参与计算所以最后结果为1+3+3=7。(注意此处分区的划分是Spark按照List的顺序均分为“指定分区数”个分区,具体可以参见Spark源码)!
sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12), 2).aggregate(3)(seqOP, combOp)执行过程:1, 2, 3, 4, 5, 6,划分为一个分区,划分为第二个分区:7, 8, 9, 10, 11,12
所以seqOP函数在第一个分区与3同时求最小值为1,第二个分区求最小值为3,最后comOp计算为1+3+3=7.
如果zeroValue=9的话,此时sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8), 2).aggregate(9)(seqOP, combOp)=15,赐一个最小值是1,第二个最小值是5,最后1+5+9=15
拓展:spark对序列分区划分原理:
sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8), 2)
划分函数:org.apache.spark.rdd.ParallelCollectionRDD#slice源码:
def slice[T: ClassTag](seq: Seq[T], numSlices: Int): Seq[Seq[T]] = {
if (numSlices < 1) {
throw new IllegalArgumentException("Positive number of slices required")
}
// Sequences need to be sliced at the same set of index positions for operations
// like RDD.zip() to behave as expected
def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
(0 until numSlices).iterator.map { i =>
val start = ((i * length) / numSlices).toInt
val end = (((i + 1) * length) / numSlices).toInt
(start, end)
}
}
seq match {
case r: Range =>
positions(r.length, numSlices).zipWithIndex.map { case ((start, end), index) =>
// If the range is inclusive, use inclusive range for the last slice
if (r.isInclusive && index == numSlices - 1) {
new Range.Inclusive(r.start + start * r.step, r.end, r.step)
}
else {
new Range(r.start + start * r.step, r.start + end * r.step, r.step)
}
}.toSeq.asInstanceOf[Seq[Seq[T]]]
case nr: NumericRange[_] =>
// For ranges of Long, Double, BigInteger, etc
val slices = new ArrayBuffer[Seq[T]](numSlices)
var r = nr
for ((start, end) <- positions(nr.length, numSlices)) {
val sliceSize = end - start
slices += r.take(sliceSize).asInstanceOf[Seq[T]]
r = r.drop(sliceSize)
}
slices
case _ =>
val array = seq.toArray // To prevent O(n^2) operations for List etc
positions(array.length, numSlices).map { case (start, end) =>
array.slice(start, end).toSeq
}.toSeq
}
}
中核心代码:
def positions(length: Long, numSlices: Int): Iterator[(Int, Int)] = {
(0 until numSlices).iterator.map { i =>
val start = ((i * length) / numSlices).toInt
val end = (((i + 1) * length) / numSlices).toInt
(start, end)
}
}
可知是平均划分的,例如当我们将6个元素划分2个分区的话,positions函数的输出结果为:
(0,3)和(3,6)