最近在使用Spark Streaming进行流式计算过程中,遇到在过滤函数中需要用到外部过滤条件列表,且列表会随时更新,一开始只是在main函数中获取过滤条件列表,但是后来发现streaming程序每次触发并非重新执行一遍main函数,部分代码(个人理解为非spark DAG有向图中rdd依赖链中的代码,也就是在driver端执行的这一部分)只会在streaming程序启动的时候执行一次,因此也就没办法做到实时更新,后来了解到broadcast广播变量,并在foreachRdd中更新广播变量,上代码
private static volatile Broadcast<Set<String>> broadcast = null;
SparkConf conf = new SparkConf()
.setMaster("local[*]")
.setAppName("venus-monitor")
.set("spark.shuffle.blockTransferService", "nio");
JavaSparkContext sc = new JavaSparkContext(conf);
sc.setLogLevel("WARN");
JavaStreamingContext jssc = new JavaStreamingContext(sc, new Duration(60000));
broadcast = sc.broadcast(getDomainSet());
JavaDStream<String> computelog=null;
computelog.foreachRDD(new VoidFunction<JavaRDD<String>>() {
@Override
public void call(JavaRDD<String> stringJavaRDD) throws Exception {
//获取广播变量值
broadcast.value();
//释放广播变量值
broadcast.unpersist();
//重新广播,更新值列表
sc.broadcast(getDomainSet());
}
});
运行报错:DStream checkpointing has been enabled but the DStreams with their functions are not serializable
16:27:00,236 ERROR [main] internal.Logging$class (Logging.scala:91) - Error starting the context, marking it as stopped
java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable
org.apache.spark.api.java.JavaSparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.api.java.JavaSparkContext, value: org.apache.spark.api.java.JavaSparkContext@3aa41da1)
- field (class: com.pingan.cdn.log.VenusMonitor$3, name: val$sc, type: class org.apache.spark.api.java.JavaSparkContext)
- object (class com.pingan.cdn.log.VenusMonitor$3, com.pingan.cdn.log.VenusMonitor$3@26586b74)
- field (class: org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1, name: foreachFunc$1, type: interface org.apache.spark.api.java.function.VoidFunction)
- object (class org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1, <function1>)
- field (class: org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, name: cleanedF$1, type: interface scala.Function1)
- object (class org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3, <function2>)
- writeObject data (class: org.apache.spark.streaming.dstream.DStream)
- object (class org.apache.spark.streaming.dstream.ForEachDStream, org.apache.spark.streaming.dstream.ForEachDStream@77a074b4)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 16)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.spark.streaming.dstream.ForEachDStream@77a074b4))
- writeObject data (class: org.apache.spark.streaming.dstream.DStreamCheckpointData)
- object (class org.apache.spark.streaming.dstream.DStreamCheckpointData, [
0 checkpoint files
经排查,原因为传入的JavaSparkContext sc无法序列化的原因,把 sc.broadcast(getDomainSet())改为通过rdd获取SparkContext 并更新广播变量新即可
public void call(JavaRDD<String> stringJavaRDD) throws Exception {
//获取广播变量值
broadcast.value();
//释放广播变量值
broadcast.unpersist();
//重新广播,更新值列表
broadcast=stringJavaRDD.context().broadcast(getDomainSet(), ClassManifestFactory.classType(Set.class));
}