there are several nice techniques in spark,eg. in user api side.here will dive into it check how does spark implement them.
1.abstract(functions in RDD)
group | function | feature | principle | ||
1 | first() | retrieve the first element in this rdd,if it's more than one partitons,the first partition will be taken by priority. esp,it will call take(1) internally. | runs a job pairition by partition untill the total amount reaches the expected number | ||
take(n) | extract the first n elements in this rdd.it's the equivalent of first() if n is 1 | ||||
2 | top(n)(order) | extract the top (max by default) N elements. calls takeOrdered(num)(ord.reverse) internally.several search engine says solr will use similar technique to figure out it. | concurrently spawns all tasks to do the same operation on respective partiton,ie each ,ie each tasks will try to retrieve the 'n' elements. | ||
max()(order) | retrieve the max element.though they are different algorithms with top(n) internally,both are the same effect(performance?) in finally. uses rdd.reduce(ord.max) internally. | ||||
3 | min()(order) | it's in the opposite of max(), uses rdd.reduce(ord.min) internally. | simiar to top(n) | ||
takeOrdered(n)(order) | in the opposite of top(n).similar to min() but take N minimum items. | ||||
4 | collect() | retrieve all the results for this rdd computation.so OOM exception will occur occasionally | similar to top(n),but here each task will not be limited to be n instead of 'max' |
2.techniques
a.lazy computation & compuates range by range
eg. in terms of take(n),spark can act as a lazy-worker:action when need! that is spark will try to use least resources as far as possible.see below for details:
/**-estimates partitions step by step to decrease resource consumption.ie lazy copmutation.
* Take the first num elements of the RDD. It works by first scanning one partition, and use the
* results from that partition to estimate the number of additional partitions needed to satisfy
* the limit.-loop runs(continued jobs) to estimate whether several partitons's results are satisfied the target num.
* the result returned is sorted by partiton sequnce.
* @note due to complications in the internal implementation, this method will raise
* an exception if called on an RDD of `Nothing` or `Null`.
*/
def take(num: Int): Array[T] = withScope {
if (num == 0) {
new Array[T](0)
} else {
val buf = new ArrayBuffer[T]
val totalParts = this.partitions.length
var partsScanned = 0
while (buf.size < num && partsScanned < totalParts) { //-loop to check whether results is satisfied the target
//1 -compute what partitions range to run
// The number of partitions to try in this iteration. It is ok for this number to be
// greater than totalParts because we actually cap it at totalParts in runJob.
var numPartsToTry = 1
if (partsScanned > 0) {
log.info(s"-step to next loop,numPartsToTry=${numPartsToTry}")
// If we didn't find any rows after the previous iteration, quadruple/四位相乘 and retry.
// Otherwise, interpolate/篡改 the number of partitions we need to try, but overestimate/高估
// it by 50%. We also cap/覆盖 the estimation in the end.
if (buf.size == 0) { //-no any data in prevous scanned partitons,ranges into more partitions
numPartsToTry = partsScanned * 4
} else {
// the left side of max is >=1 whenever partsScanned >= 2
//-estimate the remain parts to compute.but estimated total parts=num/buf.size * partsScanned * 1.5
numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)
numPartsToTry = Math.min(numPartsToTry, partsScanned * 4) //-narrow down the partiton range
}
}
val left = num - buf.size
//-step(range) to next run
val p = partsScanned until math.min(partsScanned + numPartsToTry, totalParts)
//-2 proceed with scanning the remain size of each specified partitions.similar to solr's group query
val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true)
//-3 add up to total buf;note:here doesn't take num-buf.size per partiton,but the remain size as the buf is mutable
res.foreach(buf ++= _.take(num - buf.size)) //-change buf size per partition
partsScanned += numPartsToTry
}
buf.toArray
}
again ,spark will try to estimate the partitons to be computated by current scanned items amount.ie numPartsToTry.
of course ,this feature is dependented on the parted-computation utility in spark.
b.lazy load by iterator
by diving into takeOrdered(n),some nice stuffs are shown here,
/**-similar to a search engine,asign the request 'num' to each partition,then merge all partitions' result.so here
* the mapPartitons() is called.
* Returns the first k (smallest) elements from this RDD as defined by the specified
* implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]].
* For example:
* {{{
* sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)
* // returns Array(2)
*
* sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)
* // returns Array(2, 3)
* }}}
*
* @param num k, the number of elements to return
* @param ord the implicit ordering for T
* @return an array of top elements
*/
def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
if (num == 0) {
Array.empty
} else {
//1 retrieve top n items per partition
val mapRDDs = mapPartitions { items =>
// Priority keeps the largest elements, so let's reverse the ordering.
//-restore to small to large order:as the ord is used to trim the smallest element,ie element count is limited
// in here
val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
queue ++= util.collection.Utils.takeOrdered(items, num)(ord) //-keep large to samll order to limit items
Iterator.single(queue)
}
//2 merge all the results into final n items
if (mapRDDs.partitions.length == 0) {
Array.empty
} else {
//-merge the individual partition's sub-result
mapRDDs.reduce { (queue1, queue2) =>
queue1 ++= queue2
queue1 //-always keep left one to comulate; element count is implemented by queue,see above
}.toArray.sorted(ord) //-resort by raw ord(reverse order)
}
}
note:items is a Iterator,which means that only a reference to the underlying storage is cost other than a concreate a Array or Seq!
for more clearly ,we can demostrate some snippets:
a.driver api
val maprdd = fmrdd.map((_,1)) //-MapPartitionsRDD[3]
b.rdd internal
/**
* Return a new RDD by applying a function to all elements of this RDD.
*/
def map[U: ClassTag](f: T => U): RDD[U] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF)) //-so 'this' will be parent rdd
c. then dive into iter.map()
def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] {
def hasNext = self.hasNext
def next() = f(self.next())
}
so we know that,every key-value pari will be read in per loop(callbac func 'f()').
ie. embeded processure: Fn->...->F2(F1(read root rdd's kv pair1))),then Fn(F..(kv pair2))
also,since every RDD#iterator()(bedides HadoopRDD's one ) will produce a new Iterator(see above),so no any 'no more elements exception ' will rise for follow RDD's calls.