Spark共享变量

Spark共享变量的描述http://spark.apache.org/docs/1.6.3/programming-guide.html#shared-variables

  Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.  

  (通常,当传递给Spark操作(例如map或reduce)的函数在远程集群节点上执行时,它将在函数中使用的所有变量的单独副本上工作。这些变量被复制到每台机器上,而对远程机器上的变量的更新不会被传播回驱动程序。在任务之间支持一般的读写共享变量将是低效的。不过,Spark确实为两个常见的使用模式提供了两种有限类型的共享变量:广播变量和累加器。)

一,Broadcast Variables(广播变量)

1,描述

   Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

  (广播变量允许程序员将只读变量保存在每台机器上,而不是将其复制到任务中。例如,可以使用它们以有效的方式向每个节点提供一个大型输入数据集的副本。Spark还尝试使用高效的广播算法分发广播变量,以降低通信成本。)

  Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

  (Spark的Action的操作,由分布式的“shuffle”操作分隔。Spark自动广播每个阶段中任务所需的公共数据。以这种方式广播的数据以序列化形式缓存并在运行每个任务之前反序列化。这意味着,当跨多个阶段的任务需要相同的数据或以反序列化形式缓存数据时,显式创建广播变量只会有用。)

2,Java的实现

package com.lyl.it;

import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.broadcast.Broadcast;

public class BroadcastTest {
	
	public static void main(String[] args) {
		SparkConf conf = new SparkConf().setAppName("Broadcast").setMaster("local");
		JavaSparkContext sc = new JavaSparkContext(conf);
		
		final int f = 3;
		final Broadcast<Integer> broadCastFactor = sc.broadcast(f);
		
		List<Integer> list = Arrays.asList(1,2,3,4,5);
		JavaRDD<Integer> listRDD = sc.parallelize(list);
		JavaRDD<Integer> result = listRDD.map(new Function<Integer, Integer>() {

			private static final long serialVersionUID = 1L;

			@Override
			public Integer call(Integer num) throws Exception {
//				return num * f;
				return num * broadCastFactor.value();
			}
		});
		
		result.foreach(new VoidFunction<Integer>() {
		
			private static final long serialVersionUID = 1L;

			@Override
			public void call(Integer num) throws Exception {
				System.out.println(num);
				
			}
		});
		
		sc.close();
		
	}

}

结果如下:

18/07/25 10:19:38 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
18/07/25 10:19:38 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2170 bytes)
18/07/25 10:19:39 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
3
6
9
12
15
18/07/25 10:19:39 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 915 bytes result sent to driver
18/07/25 10:19:39 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 544 ms on localhost (1/1)

3,Scala的实现

package com.lyl.it

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object BroadcastTest {
  
     def main(args: Array[String]): Unit = {
         val conf = new SparkConf().setAppName("BroadcastTest").setMaster("local")
         val sc = new SparkContext(conf)  
          
         val f = 3
         val broadCastFactor = sc.broadcast(f)
         val list = Array(1,2,3,4,5)
         sc.parallelize(list)
           .map(num => num*broadCastFactor.value)
           .foreach {
             num => println(num)
           }          
     
         
     }
}

二,Accumulators(累加器)

 1,描述

  Accumulators are variables that are only “added” to through an associative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types. If accumulators are created with a name, they will be displayed in Spark’s UI. This can be useful for understanding the progress of running stages (NOTE: this is not yet supported in Python).

  (累加器是只通过关联操作“添加”的变量,因此可以并行地有效地支持。它们可以用于实现计数器(如MapReduce)或和。Spark本机支持数字类型的累加器,程序员可以添加对新类型的支持。如果使用名称创建累加器,它们将显示在Spark的UI中。这对于理解运行阶段的进展非常有用(注意:在Python中还不支持这一点)。)

  An accumulator is created from an initial value v by calling SparkContext.accumulator(v). Tasks running on the cluster can then add to it using the add method or the += operator (in Scala and Python). However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method.

  (通过调用SparkContext.accumulator(v),从初始值v创建累加器。然后,在集群上运行的任务可以使用add方法或+=运算符(在Scala和Python中)添加到集群中。然而,他们无法解读它的价值。只有驱动程序可以使用累加器的值方法读取累加器的值。)

2,Java的实现

package com.lyl.it;

import java.util.Arrays;
import java.util.List;

import org.apache.spark.Accumulator;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;

public class AccumulatorValueTest {
	
	public static void main(String[] args) {
      SparkConf conf = new SparkConf().setAppName("AccumulatorValue").setMaster("local");
      JavaSparkContext sc = new JavaSparkContext(conf);
      
      final Accumulator<Integer> sum = sc.accumulator(0,"Our Accumulator");
      
      List<Integer> list = Arrays.asList(1,2,3,4,5);
      
      JavaRDD<Integer> listRDD = sc.parallelize(list);
      listRDD.foreach(new VoidFunction<Integer>() {
		
		private static final long serialVersionUID = 1L;

		@Override
		public void call(Integer num) throws Exception {
			sum.add(num);
//			System.out.println(sum.value());
		}
	});
      
      System.out.println(sum.value());
      
     try {
		Thread.sleep(60 * 1000 * 1000);
	} catch (InterruptedException e) {
		e.printStackTrace();
	}
     
      sc.close();
		
	}

}

结果如下:

18/07/25 09:49:37 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2170 bytes)
18/07/25 09:49:37 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/07/25 09:49:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 975 bytes result sent to driver
18/07/25 09:49:37 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 675 ms on localhost (1/1)
18/07/25 09:49:37 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/07/25 09:49:37 INFO DAGScheduler: ResultStage 0 (foreach at AccumulatorValueTest.java:23) finished in 1.038 s
18/07/25 09:49:37 INFO DAGScheduler: Job 0 finished: foreach at AccumulatorValueTest.java:23, took 3.212206 s
15
18/07/25 10:12:55 INFO BlockManagerInfo: Removed broadcast_0_piece0 on localhost:56469 in memory (size: 1335.0 B, free: 1121.6 MB)
18/07/25 10:12:55 INFO ContextCleaner: Cleaned accumulator 2

也可以在UI上看到如下结果:

3,Scala的实现

package com.lyl.it

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext

object AccumulatorValueTest {
      
    def main(args: Array[String]): Unit = {
      val conf = new SparkConf().setAppName("AccumulatorValueTest").setMaster("local")
      val sc = new SparkContext(conf)
      
      val sum = sc.accumulator(0,"Our Accumulator")
      
      val list = Array(1,2,3,4,5)
      
      sc.parallelize(list)
        .foreach {
         num => sum.add(num)
         }
      
      println(sum.value)
    }
}

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Spark共享变量主要有两种类型:累加器和广播变量。累加器用于对信息进行聚合,而广播变量则用于高效分发较大对象。 累加器是一种只能被“加”的变量,可以在分布式计算中进行并行操作,最终得到一个全局的结果。累加器通常用于计数器、求和等场景,可以在不同的节点上进行并行计算,最终将结果汇总。例如,可以使用累加器来统计某个单词在整个数据集中出现的次数。 广播变量则是一种将较大的只读数据分发到所有节点上的机制,可以在分布式计算中减少网络传输和内存消耗。广播变量通常用于将一些只读数据(如配置信息、字典等)在所有节点上缓存一份,以便在任务执行时快速访问。例如,可以使用广播变量将一个大型的机器学习模型分发到所有节点上,以便在任务执行时快速加载。 下面是两个Spark共享变量的例子: 1.使用累加器统计单词出现次数 ```python from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("wordCount").setMaster("local") sc = SparkContext(conf=conf) # 创建一个累加器 wordCount = sc.accumulator(0) def countWords(line): global wordCount words = line.split() for word in words: wordCount += 1 # 读取文件并进行单词计数 lines = sc.textFile("file.txt") lines.foreach(countWords) # 输出单词总数 print("Total words: ", wordCount.value) ``` 2.使用广播变量缓存机器学习模型 ```python from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("machineLearning").setMaster("local") sc = SparkContext(conf=conf) # 创建一个广播变量 model = sc.broadcast(loadModel()) def predict(data): # 使用广播变量中的模型进行预测 result = model.value.predict(data) return result # 读取数据并进行预测 data = sc.textFile("data.txt") result = data.map(predict) # 输出预测结果 print(result.collect()) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值