[spark-src-core] 5.big data techniques in spark

  there are several nice techniques in spark,eg. in user api side.here will dive into it check how does spark  implement them.

 

1.abstract(functions in RDD)

groupfunctionfeature principle  
1first()

 retrieve the first element in this rdd,if it's more than one partitons,the first partition will be taken by priority.

  esp,it will call take(1) internally.

 runs a job pairition by partition untill the total amount

reaches the expected number

  
 take(n) extract the first n elements in this rdd.it's the equivalent of first() if n is 1   
2top(n)(order)

 extract the top (max by default) N elements.

  calls takeOrdered(num)(ord.reverse) internally.several search engine says solr will use similar technique to 

figure out it.

 concurrently spawns all tasks to do the same operation 

on respective partiton,ie each ,ie each tasks will try 

to retrieve the 'n' elements.

  
 max()(order)

 retrieve the max element.though they are different algorithms with top(n) internally,both are the same effect(performance?) in finally.

  uses rdd.reduce(ord.max) internally.

   
3min()(order)it's in the opposite of max(),  uses rdd.reduce(ord.min) internally. simiar to top(n)  
 takeOrdered(n)(order) in the opposite of top(n).similar to min() but take N minimum items.   
4collect() retrieve all the results for this rdd computation.so OOM exception will occur occasionally similar to top(n),but here each task will not be limited to be n instead of 'max'  

 

 

2.techniques

a.lazy computation & compuates range by range

  eg. in terms of take(n),spark can act as a lazy-worker:action when need! that is spark will try to use least resources as far as possible.see below for details:

/**-estimates partitions step by step to decrease resource consumption.ie lazy copmutation.
   * Take the first num elements of the RDD. It works by first scanning one partition, and use the
   * results from that partition to estimate the number of additional partitions needed to satisfy
   * the limit.-loop runs(continued jobs) to estimate whether several partitons's results are satisfied the target num.
   *  the result returned is sorted by partiton sequnce.
   * @note due to complications in the internal implementation, this method will raise
   * an exception if called on an RDD of `Nothing` or `Null`.
   */
  def take(num: Int): Array[T] = withScope {
    if (num == 0) {
      new Array[T](0)
    } else {
      val buf = new ArrayBuffer[T]
      val totalParts = this.partitions.length
      var partsScanned = 0
      while (buf.size < num && partsScanned < totalParts) { //-loop to check whether results is satisfied the target
        //1 -compute what partitions range to run
        // The number of partitions to try in this iteration. It is ok for this number to be
        // greater than totalParts because we actually cap it at totalParts in runJob.
        var numPartsToTry = 1
        if (partsScanned > 0) {
          log.info(s"-step to next loop,numPartsToTry=${numPartsToTry}")
          // If we didn't find any rows after the previous iteration, quadruple/四位相乘 and retry.
          // Otherwise, interpolate/篡改 the number of partitions we need to try, but overestimate/高估
          // it by 50%. We also cap/覆盖 the estimation in the end.
          if (buf.size == 0) { //-no any data in prevous scanned partitons,ranges into more partitions
            numPartsToTry = partsScanned * 4
          } else {
            // the left side of max is >=1 whenever partsScanned >= 2
            //-estimate the remain parts to compute.but estimated total parts=num/buf.size * partsScanned * 1.5
            numPartsToTry = Math.max((1.5 * num * partsScanned / buf.size).toInt - partsScanned, 1)
            numPartsToTry = Math.min(numPartsToTry, partsScanned * 4) //-narrow down the partiton range
          }
        }

        val left = num - buf.size
        //-step(range) to next run
        val p = partsScanned until math.min(partsScanned + numPartsToTry, totalParts)
        //-2 proceed with scanning the remain size of each specified partitions.similar to solr's group query
        val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true)
        //-3 add up to total buf;note:here doesn't take num-buf.size per partiton,but the remain size as the buf is mutable
        res.foreach(buf ++= _.take(num - buf.size)) //-change buf size per partition
        partsScanned += numPartsToTry
      }

      buf.toArray
    }

   again ,spark will try to estimate the partitons to be computated by current scanned items amount.ie numPartsToTry.

  of course ,this feature is dependented on the parted-computation utility in spark.

 

b.lazy load by iterator

  by diving into takeOrdered(n),some nice stuffs are shown here,

/**-similar to a search engine,asign the request 'num' to each partition,then merge all partitions' result.so here
    * the mapPartitons() is called.
   * Returns the first k (smallest) elements from this RDD as defined by the specified
   * implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]].
   * For example:
   * {{{
   *   sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)
   *   // returns Array(2)
   *
   *   sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)
   *   // returns Array(2, 3)
   * }}}
   *
   * @param num k, the number of elements to return
   * @param ord the implicit ordering for T
   * @return an array of top elements
   */
  def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {
    if (num == 0) {
      Array.empty
    } else {
      //1 retrieve top n items per partition
      val mapRDDs = mapPartitions { items =>
        // Priority keeps the largest elements, so let's reverse the ordering.
        //-restore to small to large order:as the ord is used to trim the smallest element,ie element count is limited
        // in here
        val queue = new BoundedPriorityQueue[T](num)(ord.reverse)
        queue ++= util.collection.Utils.takeOrdered(items, num)(ord) //-keep large to samll order to limit items
        Iterator.single(queue)
      }
      //2 merge all the results into final n items
      if (mapRDDs.partitions.length == 0) {
        Array.empty
      } else {
        //-merge the individual partition's sub-result
        mapRDDs.reduce { (queue1, queue2) =>
          queue1 ++= queue2
          queue1 //-always keep left one to comulate; element count is implemented by queue,see above
        }.toArray.sorted(ord) //-resort by raw ord(reverse order)
      }
    }
  

   note:items is a Iterator,which means that only a reference to the underlying storage is cost other than a concreate a Array or Seq! 

  for more clearly ,we can demostrate some snippets:

  a.driver api

val maprdd = fmrdd.map((_,1)) //-MapPartitionsRDD[3]

  b.rdd internal

/**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF)) //-so 'this' will be parent rdd
  

   c. then dive into iter.map()

def map[B](f: A => B): Iterator[B] = new AbstractIterator[B] {
    def hasNext = self.hasNext
    def next() = f(self.next())
 }

   so we know that,every key-value pari will be read in per loop(callbac func 'f()').

   ie. embeded processure: Fn->...->F2(F1(read root rdd's kv pair1))),then Fn(F..(kv pair2))

 

   also,since every RDD#iterator()(bedides HadoopRDD's one ) will produce a new Iterator(see above),so no any 'no more elements exception ' will rise for follow RDD's calls.

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值