spark broadcoast 广播变量

最新推荐文章于 2024-07-22 15:29:20 发布

孩子加油孩子

最新推荐文章于 2024-07-22 15:29:20 发布

阅读量394

点赞数

分类专栏： spark

spark 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

https://blog.csdn.net/weixin_41804049/article/details/79903472

个人理解： spark driver 会将普通变量发送到每个task中，如果该变量特别大，会导致内存溢出，所以，使用广播变量，driver将变量广播到每个executor中，每个task向executor去取需要的变量，即可避免内存溢出。（这个理解并不全面，广播变量在多个stages 中都需要相同的数据时，就把这个数据定义成广播变量。）

官网中解释：

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

（1）当任务跨越了多个stages，并且需要相同的数据；（stages的划分是由 shufflue决定的，shufflue是由宽依赖的RDD产生，宽依赖：子分区的数据是由有多个父分区数据处理得来；窄依赖：父分区的数据只会流向一个子分区）

（2）当用反序列化的形式缓存数据的时候；

广播变量:

比如数据库中一份公共配置表格，需要同步给各个节点进行查询。

广播变量允许程序缓存一个只读的变量在每台机器上面，而不是每个任务保存一份拷

贝。例如，利用广播变量，我们能够以一种更有效率的方式将一个大数据量输入集合的

副本分配给每个节点。Spark也尝试着利用有效的广播算法去分配广播变量，以减少通

信的成本。
一个广播变量可以通过调用SparkContext.broadcast(v)方法从一个初始变量v中创建。广

播变量是v的一个包装变量，它的值可以通过value方法访问，下面的代码说明了这个过

程：

WordCount程序详细图解可参照连接https://blog.csdn.net/weixin_41804049/article/details/79903472

广播变量图解：