为什么RDD不能作为广播变量传递

最新推荐文章于 2021-10-10 10:40:08 发布

Svenran

最新推荐文章于 2021-10-10 10:40:08 发布

阅读量650

点赞数

文章标签： spark

本文链接：https://blog.csdn.net/weixin_43320509/article/details/105677000

版权

今天在使用spark的广播变量时将rdd作为了广播变量广播出去，但是本地模式下没有报错，运行结果也正确，但是在yarn cluster下运行却一直报Nullpoint空指针错误，经查发现rdd不能作为广播变量的形式进行广播，但是本地模式为什么不报错呢，于是我对rdd有了进一步的研究。

首先rdd是什么呢，弹性式分布数据集。我的第一反应就是这个词。
这个词包含了几个关键字：
1.弹性
2.分布
3.数据集
那么什么是弹性，在我的理解，弹性指的是：
1.可以随意编辑里面的数据
2.数据类型可以为多种
3.可以相互转换（array<->rdd rdd<->dataset）

分布：
rdd是跨集群节点划分的，可以并行操作。

数据集：
多种数据的集合。

那么rdd能存储数据吗？
通过官网我们可以看到：
RDD Operations
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

这样的一段话
谷歌翻译过来就是：
RDD操作
RDD支持两种类型的操作：转换（从现有操作创建新的数据集）和动作（在操作数据集上进行计算后，将值返回给驱动程序）。例如，map是一个转换，该转换将每个数据集元素都传递给一个函数，并返回代表结果的新RDD。另一方面，这reduce是一个使用某些函数聚合RDD的所有元素并将最终结果返回给驱动程序的操作（尽管也有并行操作reduceByKey返回了分布式数据集）。

Spark中的所有转换都是惰性的，因为它们不会立即计算出结果。相反，**他们只记得应用于某些基本数据集（例如文件）的转换。**仅当动作要求将结果返回给驱动程序时才计算转换。这种设计使Spark可以更高效地运行。例如，我们可以认识到通过创建的数据集map将在中使用，reduce并且仅将结果返回reduce给驱动程序，而不是将较大的映射数据集返回给驱动程序。

默认情况下，每次在其上执行操作时，都可能会重新计算每个转换后的RDD。但是，您也可以使用（或）方法将RDD 保留在内存中，在这种情况下，Spark会将元素保留在群集中，以便下次查询时可以更快地进行访问。还支持将RDD持久存储在磁盘上，或在多个节点之间复制。

通过这段话我们可以看出，rdd在建立后并不是直接存储数据的，只有在行动算子后才会去拉去数据。
所以我们在使用broadcast时将rdd广播出去在yarn cluster运行时才会一直报空指针异常。