Spark闭包初探

最新推荐文章于 2024-08-15 07:37:00 发布

逐风正在敲代码

最新推荐文章于 2024-08-15 07:37:00 发布

阅读量431

点赞数 1

分类专栏： hadoop Spark 文章标签： Spark 闭包

本文链接：https://blog.csdn.net/qq_31405633/article/details/90265551

版权

hadoop 同时被 2 个专栏收录

42 篇文章 3 订阅

订阅专栏

Spark

18 篇文章 0 订阅

订阅专栏

Spark闭包初探

1.什么是闭包

闭包的大致作用就是：函数可以访问函数外面的变量，但是函数内对变量的修改，在函数外是不可见的.

2.Spark官网对闭包的描述

One of the harder things about Spark is understanding the scope and life cycle of variables and methods when executing code across a cluster. RDD operations that modify variables outside of their scope can be a frequent source of confusion. In the example below we’ll look at code that uses foreach() to increment a counter, but similar issues can occur for other operations as well.

Spark的一个难点是在跨集群执行代码时理解变量和方法的范围和生命周期。修改其范围之外的变量的RDD操作可能经常引起混淆。在下面的示例中，我们将查看使用foreach（）递增计数器的代码，但同样的问题也可能发生在其他操作中。

Example

Consider the naive RDD element sum below, which may behave differently depending on whether execution is happening within the same JVM. A common example of this is when running Spark in local mode (--master = local[n]) versus deploying a Spark application to a cluster (e.g. via spark-submit to YARN):
var counter = 0
var rdd = sc.parallelize(data)

// Wrong: Don't do this!!
rdd.foreach(x => counter += x)

println("Counter value: " + counter)
Local vs. cluster modes

The behavior of the above code is undefined, and may not work as intended. To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). This closure is serialized and sent to each executor.

上述代码的行为未定义，可能无法按预期工作。为了执行作业，Spark将RDD操作的处理分解为任务，每个任务都由执行程序执行。在执行之前，Spark计算任务的闭包。闭包是那些变量和方法，它们必须是可见的，以便执行程序在RDD上执行其计算（在本例中为foreach（））。该闭包被序列化并发送给每个执行者。

The variables within the closure sent to each executor are now copies and thus, when counter is referenced within the foreach function, it’s no longer the counter on the driver node. There is still a counter in the memory of the driver node but this is no longer visible to the executors! The executors only see the copy from the serialized closure. Thus, the final value of counter will still be zero since all operations on counter were referencing the value within the serialized closure.

发送给每个执行程序的闭包内的变量现在是副本，因此，当在foreach函数中引用计数器时，它不再是驱动程序节点上的计数器。驱动程序节点的内存中仍然有一个计数器，但执行程序不再可见！执行程序只能看到序列化闭包中的副本。因此，计数器的最终值仍然为零，因为计数器上的所有操作都引用了序列化闭包内的值。

In local mode, in some circumstances, the foreach function will actually execute within the same JVM as the driver and will reference the same original counter, and may actually update it.

在本地模式下，在某些情况下，foreach函数实际上将在与驱动程序相同的JVM中执行，并将引用相同的原始计数器，并且可能实际更新它。

To ensure well-defined behavior in these sorts of scenarios one should use an Accumulator. Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. The Accumulators section of this guide discusses these in more detail.

为了确保在这些场景中定义良好的行为，应该使用累加器。Spark中的累加器专门用于提供一种机制，用于在跨集群中的工作节点拆分执行时安全地更新变量。本指南的“累加器”部分更详细地讨论了这些内容。

In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode. Use an Accumulator instead if some global aggregation is needed.

通常，闭包 - 类似循环或本地定义的方法的构造不应该用于改变某些全局状态。Spark没有定义或保证从闭包外部引用的对象的突变行为。执行此操作的某些代码可能在本地模式下工作，但这只是偶然的，并且此类代码在分布式模式下不会按预期运行。如果需要某些全局聚合，请使用累加器。

Printing elements of an RDD

Another common idiom is attempting to print out the elements of an RDD using rdd.foreach(println) or rdd.map(println). On a single machine, this will generate the expected output and print all the RDD’s elements. However, in cluster mode, the output to stdout being called by the executors is now writing to the executor’s stdout instead, not the one on the driver, so stdout on the driver won’t show these! To print all elements on the driver, one can use the collect() method to first bring the RDD to the driver node thus: rdd.collect().foreach(println). This can cause the driver to run out of memory, though, because collect() fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the take(): rdd.take(100).foreach(println).

另一个常见的习惯用法是尝试使用rdd.foreach(println)或rdd.map(println)打印出RDD的元素。在一台机器上，这将生成预期的输出并打印所有RDD的元素。但是，在集群模式下，执行程序调用的stdout输出现在写入执行程序的stdout，而不是驱动程序上的那个，因此驱动程序上的stdout不会显示这些！要打印驱动程序上的所有元素，可以使用collect()方法首先将RDD带到驱动程序节点：rdd.collect()。foreach(println)。但是，这会导致驱动程序内存不足，因为collect()会将整个RDD提取到一台机器上;如果你只需要打印RDD的一些元素，更安全的方法是使用take()：rdd.take(100).foreach(println)

3.如何理解

RDD相关操作都需要传入自定义闭包函数(closure),如果这个函数需要访问外部变量,那么需要遵循一定的规则,否则会抛出运行时异常.闭包函数传入到节点时,需要经过下面的步骤:

驱动程序,通过反射,运行时找到闭包访问的所有边浪,并封装成一个对象,然后序列化该对象
将序列化后的对象通过网络传输到workder节点
worker节点反序列化闭包对象
worker节点执行闭包函数

注意:外部变量在闭包内的修改不会反馈到驱动程序

简而言之,就是通过网络传递函数,然后执行.所以,被传递的变量必须可序列化,否则传递失败,本地执行时,仍然会执行上面四步.

广播机制也可以做到这一点,但是频繁的使用广播会使代码不够简洁,而且广播设计的初衷是将较大数据缓存到节点上,避免多次传输,从而提高计算效率,而不是用于进行外部变量访问.

4.例子

package com.ruozedata.spark.core

import org.apache.spark.{SparkConf, SparkContext}

object UnderstandingClosures {
  def main (args: Array[String]): Unit = {
      
    val sparkConf = new SparkConf()
    sparkConf.setAppName("").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    val data = sc.parallelize(List(1,2,3,4,5,6))
    var counter = 0
    data.map(x =>counter += x)
    println(counter)
    sc.stop()
  }
}

结果可以看到闭包的存在,map相当于一个闭包,counter是没有累加器的效果.

正确的写法如下:

package com.ruozedata.spark.core

import org.apache.spark.{SparkConf, SparkContext}

object UnderstandingClosures {

  def main (args: Array[String]): Unit = {

    val sparkConf = new SparkConf()
    sparkConf.setAppName("").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)

    val data = sc.parallelize(List(1,2,3,4,5,6))
    val counter = sc.doubleAccumulator("acc")
    data.map{
      x =>
        counter.add(x)
        x
    }.foreach(println)
    println(counter.value)
    sc.stop()
  }
}