RDD源码分析：ClosureCleaner

最新推荐文章于 2024-08-29 11:24:16 发布

chunhuakai1963

最新推荐文章于 2024-08-29 11:24:16 发布

阅读量126

点赞数

文章标签：大数据 scala java

原文链接：https://my.oschina.net/u/3233205/blog/1936757

版权

最近在看spark的源码，发现好多rdd如：map，flatMap，filter等rdd中都有一段相同的代码：

val cleanF = sc.clean(f)

当时就很疑惑，为什么都有这段代码，rdd的逻辑又不是一样的，于是，继续往下看，最终找到了ClosureCleaner.clean()方法，于是开始研究clean方法：

private def clean(

func: AnyRef,

checkSerializable: Boolean,

cleanTransitively: Boolean,

accessedFields: Map[Class[_], Set[String]]): Unit = {

if (!isClosure(func.getClass)) {

logWarning("Expected a closure; got " + func.getClass.getName)

return

}

// TODO: clean all inner closures first. This requires us to find the inner objects.

// TODO: cache outerClasses / innerClasses / accessedFields

if (func == null) {

return

}

logDebug(s"+++ Cleaning closure $func (${func.getClass.getName}) +++")

// A list of classes that represents closures enclosed in the given one

// 返回表示封闭对象中包含的闭包的类的列表

val innerClasses = getInnerClosureClasses(func)

// A list of enclosing objects and their respective classes, from innermost to outermost

// An outer object at a given index is of type outer class at the same index

// 递归获取所有闭包及最外部对象的class实例和对象实例。判断逻辑是func中包含$outer域，且不为null。这里要说明的是，Scala会为每个函数合成对象，每个函数都有一个$outer，但只有函数是闭包时，$outer才不为空。

val (outerClasses, outerObjects) = getOuterClassesAndObjects(func)

// For logging purposes only

val declaredFields = func.getClass.getDeclaredFields

val declaredMethods = func.getClass.getDeclaredMethods

logDebug(" + declared fields: " + declaredFields.size)

declaredFields.foreach { f => logDebug(" " + f) }

logDebug(" + declared methods: " + declaredMethods.size)

declaredMethods.foreach { m => logDebug(" " + m) }

logDebug(" + inner classes: " + innerClasses.size)

innerClasses.foreach { c => logDebug(" " + c.getName) }

logDebug(" + outer classes: " + outerClasses.size)

outerClasses.foreach { c => logDebug(" " + c.getName) }

logDebug(" + outer objects: " + outerObjects.size)

outerObjects.foreach { o => logDebug(" " + o) }

// Fail fast if we detect return statements in closures

// 此方法主要使用asm框架访问class对象，这是一个典型的访问者模式的实现。此处作用主要检查闭包中是否有return语句，这在Spark中是不允许的。

getClassReader(func.getClass).accept(new ReturnStatementFinder(), 0)

// If accessed fields is not populated yet, we assume that

// the closure we are trying to clean is the starting one

if (accessedFields.isEmpty) {

logDebug(s" + populating accessed fields because this is the starting closure")

// Initialize accessed fields with the outer classes first

// This step is needed to associate the fields to the correct classes later

// func的每个外部对象都需要保存其被使用的域。闭包可以嵌套，如前面介绍的SomethingNotSerializable示例中，Scope("two")引用外部Scope("one")的方法，而此方法中又引用Scope("one")的外部SomethingNotSerializable对象的域，因此需要递归查找每个对象中实际被引用的域。

for (cls <- outerClasses) {

accessedFields(cls) = Set[String]()

}

// Populate accessed fields by visiting all fields and methods accessed by this and

// all of its inner closures. If transitive cleaning is enabled, this may recursively

// visits methods that belong to other classes in search of transitively referenced fields.

for (cls <- func.getClass :: innerClasses) {

getClassReader(cls).accept(new FieldAccessFinder(accessedFields, cleanTransitively), 0)

}

logDebug(s" + fields accessed by starting closure: " + accessedFields.size)

accessedFields.foreach { f => logDebug(" " + f) }

// List of outer (class, object) pairs, ordered from outermost to innermost

// Note that all outer objects but the outermost one (first one in this list) must be closures

var outerPairs: List[(Class[_], AnyRef)] = (outerClasses zip outerObjects).reverse

var parent: AnyRef = null

if (outerPairs.size > 0 && !isClosure(outerPairs.head._1)) {

// 如果func外部包含闭包或对象

// The closure is ultimately nested inside a class; keep the object of that

// class without cloning it since we don't want to clone the user's objects.

// Note that we still need to keep around the outermost object itself because

// we need it to clone its child closure later (see below).

logDebug(s" + outermost object is not a closure, so do not clone it: ${outerPairs.head}")

parent = outerPairs.head._2 // e.g. SparkContext

outerPairs = outerPairs.tail

} else if (outerPairs.size > 0) {

logDebug(s" + outermost object is a closure, so we just keep it: ${outerPairs.head}")

} else {

logDebug(" + there are no enclosing objects!")

}

// Clone the closure objects themselves, nulling out any fields that are not

// used in the closure we're working on or any of its inner closures.

// 根据accessedFields中收集到的所有func外部对象中func实际引用到的域，设置func的clone对象，未引用的域设置为null，达到清理的目的。

for ((cls, obj) <- outerPairs) {

logDebug(s" + cloning the object $obj of class ${cls.getName}")

// We null out these unused references by cloning each object and then filling in all

// required fields from the original object. We need the parent here because the Java

// language specification requires the first constructor parameter of any closure to be

// its enclosing object.

val clone = instantiateClass(cls, parent)

for (fieldName <- accessedFields(cls)) {

val field = cls.getDeclaredField(fieldName)

field.setAccessible(true)

val value = field.get(obj)

field.set(clone, value)

}

// If transitive cleaning is enabled, we recursively clean any enclosing closure using

// the already populated accessed fields map of the starting closure

// 递归清理每个外部闭包。

if (cleanTransitively && isClosure(clone.getClass)) {

logDebug(s" + cleaning cloned closure $clone recursively (${cls.getName})")

// No need to check serializable here for the outer closures because we're

// only interested in the serializability of the starting closure

clean(clone, checkSerializable = false, cleanTransitively, accessedFields)

}

parent = clone

}

// Update the parent pointer ($outer) of this closure

// 如果parent 不为null，设置$outer为parent

if (parent != null) {

val field = func.getClass.getDeclaredField("$outer")

field.setAccessible(true)

// If the starting closure doesn't actually need our enclosing object, then just null it out

// 如果func没有任何一个域被引用到，则直接把$outer引用设置为null，避免额外序列化开销

if (accessedFields.contains(func.getClass) &&

!accessedFields(func.getClass).contains("$outer")) {

logDebug(s" + the starting closure doesn't actually need $parent, so we null it out")

field.set(func, null)

} else {

// Update this closure's parent pointer to point to our enclosing object,

// which could either be a cloned closure or the original user object

field.set(func, parent)

}

logDebug(s" +++ closure $func (${func.getClass.getName}) is now cleaned +++")

// 校验func是否可被序列化，若不能被序列化则抛出异常快速失败。

if (checkSerializable) {

ensureSerializable(func)

}

private def ensureSerializable(func: AnyRef) {

try {

if (SparkEnv.get != null) {

SparkEnv.get.closureSerializer.newInstance().serialize(func)

}

} catch {

case ex: Exception => throw new SparkException("Task not serializable", ex)

}

看的时候确实很头大，但是没办法，硬着头皮看吧。最后发现ClosureCleaner类的作用就是递归清理外围类中无用域，降低序列化的开销，防止不必要的不可序列化异常。

转载于:https://my.oschina.net/u/3233205/blog/1936757

chunhuakai1963

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
RDD源码分析：ClosureCleaner

最近在看spark的源码，发现好多rdd如：map，flatMap，filter等rdd中都有一段相同的代码： val cleanF = sc.clean(f) 当时就很疑惑，为什么都有这段代码，rdd的逻辑又不是一样的，于是，继续往下看，最终找到了ClosureCleaner.clean...
复制链接

扫一扫