一天一个RDD函数-1：map

最新推荐文章于 2023-07-30 21:53:18 发布

weixin_33872660

最新推荐文章于 2023-07-30 21:53:18 发布

阅读量396

点赞数

文章标签：大数据 java ui

原文链接：https://my.oschina.net/hunglish/blog/1542495

版权

2019独角兽企业重金招聘Python工程师标准>>>

map定义

RDD中关于map函数的定义为如下：

def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
  }

函数后面的[U: ClassTag]是类型参数，它能够帮助代码在编译时，对参数进行一些限制条件，这里我们限制了应用函数的返回类型U应该是一个ClassTag对象，而则个类则存储了在泛型过程中被擦除的类型U。可以看一下ClassTag的代码：

def apply[T](runtimeClass1: jClass[_]): ClassTag[T] =
    runtimeClass1 match {
      case java.lang.Byte.TYPE      => ClassTag.Byte.asInstanceOf[ClassTag[T]]
      case java.lang.Short.TYPE     => ClassTag.Short.asInstanceOf[ClassTag[T]]
      case java.lang.Character.TYPE => ClassTag.Char.asInstanceOf[ClassTag[T]]
      case java.lang.Integer.TYPE   => ClassTag.Int.asInstanceOf[ClassTag[T]]
      case java.lang.Long.TYPE      => ClassTag.Long.asInstanceOf[ClassTag[T]]
      case java.lang.Float.TYPE     => ClassTag.Float.asInstanceOf[ClassTag[T]]
      case java.lang.Double.TYPE    => ClassTag.Double.asInstanceOf[ClassTag[T]]
      case java.lang.Boolean.TYPE   => ClassTag.Boolean.asInstanceOf[ClassTag[T]]
      case java.lang.Void.TYPE      => ClassTag.Unit.asInstanceOf[ClassTag[T]]
      case ObjectTYPE               => ClassTag.Object.asInstanceOf[ClassTag[T]]
      case NothingTYPE              => ClassTag.Nothing.asInstanceOf[ClassTag[T]]
      case NullTYPE                 => ClassTag.Null.asInstanceOf[ClassTag[T]]
      case _                        => new ClassTag[T]{ def runtimeClass = runtimeClass1 }
    }

def unapply[T](ctag: ClassTag[T]): Option[Class[_]] = Some(ctag.runtimeClass)

该类中的构造函数会将所有的传入类型强制转化为ClassTag[T]类型，也就是泛型。

同时如果在运行时指定了某种类型T，其解析构造函数unapply会返回运行时的指定类型。

然后再看一下，withScope函数的实现过程。

/**
   * Execute a block of code in a scope such that all new RDDs created in this body will
   * be part of the same scope. For more detail, see {{org.apache.spark.rdd.RDDOperationScope}}.
   *
   * Note: Return statements are NOT allowed in the given body.
   */
private[spark] def withScope[U](body: => U): U = RDDOperationScope.withScope[U](sc)(body)

首先，private[spark]是scala的访问修饰符作用域参数的语法，这个语法通常表示为：

private[x]或protected[x]

其中 x 代表某个包，类或者对象，表示可以访问这个 Private 或的 protected 的范围直到 X。

它表示可以访问这个private方法可以被该spark项目下的所有实例对象访问。这种语法的目的，是为了给传统的java访问修饰符带来更多灵活的使用方式。然后让我们跳转到RDDOperationScope类下看一下具体的实现方式：

/**
   * Execute the given body such that all RDDs created in this body will have the same scope.
   * The name of the scope will be the first method name in the stack trace that is not the
   * same as this method's.
   *
   * Note: Return statements are NOT allowed in body.
   */
  private[spark] def withScope[T](
      sc: SparkContext,
      allowNesting: Boolean = false)(body: => T): T = {
    val ourMethodName = "withScope"
    val callerMethodName = Thread.currentThread.getStackTrace()
      .dropWhile(_.getMethodName != ourMethodName)
      .find(_.getMethodName != ourMethodName)
      .map(_.getMethodName)
      .getOrElse {
        // Log a warning just in case, but this should almost certainly never happen
        logWarning("No valid method name for this RDD operation scope!")
        "N/A"
      }
    withScope[T](sc, callerMethodName, allowNesting, ignoreParent = false)(body)
  }

看注释，我们已经明白了一半，scope的定义是除了withScope方法之外的所有该RDD调用的方法集合，而withScope方法，则返回了该RDD调用的所有的方法。这里要注意的是，withScope方法仅仅起到一个register的作用，它的功能是记录所有RDD调用的函数记录，用来做DAG可视化。现有的 Spark UI 中只有 stage 的执行情况，而 stage 与用户代码中 rdd 的联系不够直接，如果代码复杂，很难根据 UI 信息了解到代码的执行情况，于是想强化 UI 中的 RDD 可视化功能，所以把所有创建 RDD 的方法包裹起来，使用 RDDOperationScope 记录 RDD 的操作历史和关联，就能达成目标。所以，这个外层包裹的withScope函数并不是真正的map函数功能。

sc.clean

/**
   * Clean a closure to make it ready to serialized and send to tasks
   * (removes unreferenced variables in $outer's, updates REPL variables)
   * If <tt>checkSerializable</tt> is set, <tt>clean</tt> will also proactively
   * check to see if <tt>f</tt> is serializable and throw a <tt>SparkException</tt>
   * if not.
   *
   * @param f the closure to clean
   * @param checkSerializable whether or not to immediately check <tt>f</tt> for serializability
   * @throws SparkException if <tt>checkSerializable</tt> is set but <tt>f</tt> is not
   *   serializable
   */
  private[spark] def clean[F <: AnyRef](f: F, checkSerializable: Boolean = true): F = {
    ClosureCleaner.clean(f, checkSerializable)
    f
  }

源码注释写的很清楚，如果一个闭包准备序列化或者准备传输到任务中去，那么它必须要有clean这一步准备工作，删除未被引用的变量同时更新一些REPL变量。至于底层怎么实现的，我们暂时不需要理解的那么深刻，仅仅知道要有这么一步工作即可。

另外，此处再普及一下闭包的知识与概念：

An object is data with functions. A closure is a function with data

前面一句话很好懂，那么后面一句话是啥意思呢？我们不需要理解的更为复杂，更为理论化，就举个例子各位就一目了然了。

已知一个函数 f( x ) = x + i ，让你求 f（3） = 3+i。

分析：要得到最终的函数值，你必须知道i的值。i称作开放项（“开”着的，对应闭包的“闭”），若上文中定义了“ inti = 1” ，则可以得到f（3） = 3+1 =4 , 即函数值若想被创建必须捕获i的值，这一过程可以被理解为做对函数执行“关闭”操作，所以叫闭包。

针对网上的一些资料，闭包的含义可以总结为以下几点：

闭包是一个有状态（不消失的私有数据）的函数。
闭包是一个有记忆的函数。
闭包相当于一个只有一个方法的紧凑对象（a compact object）

MapPartitionsRDD

/**
 * An RDD that applies the provided function to every partition of the parent RDD.
 */
private[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](
    var prev: RDD[T],
    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)
    preservesPartitioning: Boolean = false)
  extends RDD[U](prev) {

  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None

  override def getPartitions: Array[Partition] = firstParent[T].partitions

  override def compute(split: Partition, context: TaskContext): Iterator[U] =
    f(context, split.index, firstParent[T].iterator(split, context))

  override def clearDependencies() {
    super.clearDependencies()
    prev = null
  }
}

我们可以看到在该类中，定义了一个compute方法，它所将调用RDD的第一个父RDD的迭代器返回。至于这么写的机理是什么，暂时还没弄清楚，而且在map源码中，直接new一个类，就可以执行该类的方法了么？这与传统的java方法的调用方式有区别，而且该类中还没有定义apply函数。为此我写了一个简单的测试程序来检测我的疑虑：

import scala.reflect.ClassTag

/*new一个类就能引起方法调用？*/
class testNewClass[U: ClassTag, V: ClassTag](val a: U, val b: V) {
  println(a.toString, b.toString)
  println("居然真的可以直接通过new来执行伴生类中的方法")
}

object testNewClass{
  def main(args: Array[String]): Unit = {
    new testNewClass[String, String]("hello", "world")
  }
}

写了一个简单的泛型输出测试用例，发现真的可以调用。具体机制不清楚，先这么记住吧。

于是我们总结一下，所谓的map函数，就是先进行闭包的检测预处理，删除并更新一些环境变量。其次，就是调用mapPartitionsRDD函数，将map传进来的函数从第一个父RDD的partition开始，迭代使用。

转载于:https://my.oschina.net/hunglish/blog/1542495