Spark Streaming中序列化问题:org.apache.spark.SparkException: Task not serializable

利用saprk streaming实时分析数据时报的一些问题:打印日志如下:

org.apache.spark.SparkException: Task not serializable
	at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:393)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:926)
	at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:925)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
	at org.apache.spark.rdd.RDD.foreach(RDD.scala:925)
	at org.apache.spark.api.java.JavaRDDLike$class.foreach(JavaRDDLike.scala:351)
	at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:45)
	at com.myspark.sparkanalysis.web.WebSocketServer.lambda$1(WebSocketServer.java:54)
	at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:272)
	at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:272)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
	at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:416)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:257)
	at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
	at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:256)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)

导致我以上报出的问题是,JavaSpakrContext,JavaStreamingContext不能被序列化,以下我的关键spark streaming类代码如下:

package com.myspark.sparkanalysis.service;


import java.io.Serializable;
import java.util.List;

import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;
import org.springframework.stereotype.Service;

@Component
public class StreamingConfig implements Serializable,Runnable{

	
	private static final long serialVersionUID = 1L;
	
	//这里是关键,要加上 transient 关键字,表示不被序列化
	private transient JavaSparkContext javaSparkContext; 
    //这里是关键,要加上 transient 关键字,表示不被序列化
	private transient JavaStreamingContext streamingContext = null;
	
	
    public StreamingConfig(@Autowired JavaSparkContext javaSparkContext) {
    	this.javaSparkContext = javaSparkContext;
    	
    }
 


	/**
     * 开启Stream任务
     * @param server
     * @param listenerDirectory 要监听的文件夹
     */
    public void startStreamTask(StreamingConsumer server, String listenerDirectory) {
    	streamingContext = new JavaStreamingContext(javaSparkContext, Durations.seconds(20));
    	JavaDStream<String> lines = streamingContext.textFileStream(listenerDirectory);
        
		lines.map(line -> line.split(",")[2])
				.foreachRDD(rdd -> {
                    //do something....
					List<String> collect = rdd.collect();
					for (String d : collect) {
						server.sendMessageToCient(d);
					}
					//rdd.saveAsTextFile("");
				});
		
		streamingContext.start();
		try {
			streamingContext.awaitTermination();
			streamingContext.close();
		} catch (InterruptedException e) {
			e.printStackTrace();
		}
    }
    /**
     * 手动关闭Stream
     */
    public void destroyStreamTask() {
    	if(streamingContext != null) {
    		streamingContext.stop();
		}	
    }
	@Override
	public void run() {
		//startStreamTask(StreamingConsumer server, String listenerDirectory)
		
	}
}

最后修改完后,项目正确运行起来。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Spark Streaming 写入 Redis 时,可能会出现 `org.apache.spark.SparkException: Task not serializable` 异常。这是因为在 Spark ,任务需要序列化以便在集群传输,而某些对象默认不能被序列化,例如连接对象。 为了解决这个问题,可以使用 `foreachRDD` 操作来执行 Redis 写入操作。在 `foreachRDD` ,我们可以获取 RDD 的每一个分区,并在分区内对每个数据进行处理。这样可以避免在驱动程序使用连接对象。 下面是一个例子: ```python import redis # 创建 Redis 连接池 redis_pool = redis.ConnectionPool(host='localhost', port=6379) # 定义写入 Redis 的函数 def write_to_redis(rdd): r = redis.Redis(connection_pool=redis_pool) rdd.foreach(lambda x: r.set(x[0], x[1])) # 创建 Spark Streaming 上下文 ssc = ... # 读取数据流 stream = ... # 对数据流进行处理 processed_stream = ... # 将处理后的数据写入 Redis processed_stream.foreachRDD(write_to_redis) # Spark Streaming 上下文 ssc.start() ssc.awaitTermination() ``` 在上面的例子,我们首先创建了一个 Redis 连接池,然后定义了一个写入 Redis 的函数 `write_to_redis`。在 `write_to_redis` 函数,我们使用连接池创建 Redis 连接,并对 RDD 的每个元素执行 Redis 写入操作。最后,在 Spark Streaming 上下文,我们将处理后的数据流传递给 `foreachRDD` 操作,并调用 `write_to_redis` 函数将数据写入 Redis。 需要注意的是,为了避免连接对象被序列化,我们在 `write_to_redis` 函数内部创建 Redis 连接。这样,每个分区都会使用自己的连接对象,而不是共享一个连接对象,从而避免了序列化问题

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值