最近有需求需要在driver端创建好类实例,然后在rdd里面调用,但是使用过程中发现 Caused by: java.io.NotSerializableException,即序列化异常,通过查处网上资料发现是构建的类没有继承Serializable,没有继承Serializable的类是不会自动执行自动序列化操作的,因此我把构建的类继承了Serializable这个类,再次运行的时候发现依旧是序列化异常,百般不得其解。主要的我调用方式目前网上各大神提供的并不一样,用当前披露的方法并没有解决我的问题。好了,废话了这么多,写把问题贴出来:
问题
类ClassB中有一个属性的值是类ClassA, 在类ClassB有一个方法fun,方法的主要功能是在rdd里面调用ClassA里面的方法执行某种操作。然后实例化类ClassB后调用fun方法出现序列化异常,我这里ClassA和ClassB都继承了Serializable。具体代码和异常如下:
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by zhoujiamu on 2019/1/21.
*/
class ClassA extends Serializable{
def getClassName: String = this.getClass.getSimpleName
}
class ClassB(sc: SparkContext) extends Serializable{
val classA = new ClassA()
def fun(): Unit = {
val rdd = sc.makeRDD(1 to 5)
rdd.map(i => "getClassName in ClassB -> "+classA.getClassName + s": $i")
.collect.foreach(println)
}
}
object SerializableTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.setMaster("local[*]")
.setAppName("SerializableTest")
.set("spark.rdd.compress", "true")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val sc = new SparkContext(conf)
val classB = new ClassB(sc)
classB.fun()
}
}
异常如下:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.map(RDD.scala:369)
at com.haizhi.test.ClassB.fun(SerializableTest.scala:18)
at com.haizhi.test.SerializableTest$.main(SerializableTest.scala:35)
at com.haizhi.test.SerializableTest.main(SerializableTest.scala)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext@433e536f)
- field (class: com.haizhi.test.ClassB, name: sc, type: class org.apache.spark.SparkContext)
- object (class com.haizhi.test.ClassB, com.haizhi.test.ClassB@667e34b1)
- field (class: com.haizhi.test.ClassB$$anonfun$fun$1, name: $outer, type: class com.haizhi.test.ClassB)
- object (class com.haizhi.test.ClassB$$anonfun$fun$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:295)
... 12 more
按照网上大神们披露的方式去继承了可序列化的类,结果我上面的问题依然没有解决。趁此机会自己把探索过程记录一下,并把整个spark类序列都总结在此。
探索&解决
下面把遇到过的序列化异常原因和相应的解决方法以及例子一一分析
序列异常天坑1(网上常见的)
在rdd外实例化的类没有继承Serializable,在实例化类在rdd中使用,如下代码块:
class ClassA {
def getClassName: String = this.getClass.getSimpleName
}
object SerializableTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.setMaster("local[*]")
.setAppName("SerializableTest")
.set("spark.rdd.compress", "true")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val sc = new SparkContext(conf)
val classA = new ClassA()
val rdd = sc.makeRDD(1 to 5)
rdd.map(i => "getClassName in main -> " + classA.getClassName + s": $i")
.collect().foreach(println)
}
}
填坑方法
方法1:将ClassA修改为继承Serializable类
class ClassA extends Serializable {
def getClassName: String = this.getClass.getSimpleName
}
方法2:将ClassA放在rdd里面进行实例化
rdd.map(i => {
val classA = new ClassA
"getClassName in main -> " + classA.getClassName + s": $i"
}).collect().foreach(println)
方法3:将ClassA改成静态类,静态类自动实例化,在rdd里面直接调用其方法
object ClassA {
def getClassName: String = this.getClass.getSimpleName
}
object SerializableTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.setMaster("local[*]")
.setAppName("SerializableTest")
.set("spark.rdd.compress", "true")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val sc = new SparkContext(conf)
val rdd = sc.makeRDD(1 to 5)
rdd.map(i => "getClassName in main -> " + ClassA.getClassName + s": $i")
.collect().foreach(println)
}
}
序列异常天坑2
在rdd里面调用类中某个类的方法报序列化异常,代码如下:
class ClassA {
def getClassName: String = this.getClass.getSimpleName
}
class ClassB(sc: SparkContext) extends Serializable{
val classA = new ClassA()
def fun(): Unit = {
val rdd = sc.makeRDD(1 to 5)
rdd.map(i => "getClassName in ClassB -> "+classA.getClassName + s": $i")
.collect.foreach(println)
}
}
object SerializableTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.setMaster("local[*]")
.setAppName("SerializableTest")
.set("spark.rdd.compress", "true")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val sc = new SparkContext(conf)
val classB = new ClassB(sc)
val rdd = sc.makeRDD(1 to 5)
rdd.map(i => "getClassName in main -> " + classB.classA.getClassName + s": $i")
.collect().foreach(println)
}
}
如上述,在rdd里面调用ClassB中属性ClassA中的方法报序列化异常
填坑方法
方法1:这个ClassB有点脑残,把ClassA作为属性实在不可取,如果只是为了达到调用ClassA内的方法,则可以让ClassB去继承ClassA
class ClassA extends Serializable {
def getClassName: String = this.getClass.getSimpleName
}
class ClassB(sc: SparkContext) extends ClassA with Serializable{
}
object SerializableTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.setMaster("local[*]")
.setAppName("SerializableTest")
.set("spark.rdd.compress", "true")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val sc = new SparkContext(conf)
val classB = new ClassB(sc)
val rdd = sc.makeRDD(1 to 5)
rdd.map(i => "getClassName in main -> " + classB.getClassName + s": $i")
.collect().foreach(println)
}
}
方法2:在rdd外先把ClassB中ClassA取出来放到一个变量里面去,再在rdd里面调用该变量
object SerializableTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.setMaster("local[*]")
.setAppName("SerializableTest")
.set("spark.rdd.compress", "true")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val sc = new SparkContext(conf)
val classB = new ClassB(sc)
val a = classB.classA
val rdd = sc.makeRDD(1 to 5)
rdd.map(i => "getClassName in main -> " + a.getClassName + s": $i")
.collect().foreach(println)
}
}
这种类似填坑1里面的,相当于重新new了一个ClassA
序列异常天坑3
啰里啰嗦写了辣么多,这个就是我踏进去的坑。
在类ClassB中有方法fun,和属性classA,fun调用了classA中的方法
class ClassA extends Serializable {
def getClassName: String = this.getClass.getSimpleName
}
class ClassB(sc: SparkContext) extends Serializable{
val classA = new ClassA()
def fun(): Unit = {
val rdd = sc.makeRDD(1 to 5)
rdd.map(i => "getClassName in ClassB -> "+classA.getClassName + s": $i")
.collect.foreach(println)
}
}
object SerializableTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.setMaster("local[*]")
.setAppName("SerializableTest")
.set("spark.rdd.compress", "true")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val sc = new SparkContext(conf)
val classB = new ClassB(sc)
classB.fun()
}
}
填坑方法
方法1:在fun里面不使用属性classA,而是在fun里面重新构建ClassA
def fun(): Unit = {
val classA = new ClassA()
val rdd = sc.makeRDD(1 to 5)
rdd.map(i => "getClassName in ClassB -> "+classA.getClassName + s": $i")
.collect.foreach(println)
}
这类似于天坑1的解决方式。但是很多时候我们的ClassA是一个比较全的工具类,不仅仅是在fun单个方法体里面调用,因此需要将放到ClassB作为属性。
方法2:与前面的一样,可以在fun方法里面的rdd前面先新增一个变量在调用
def fun(): Unit = {
val a = classA
val rdd = sc.makeRDD(1 to 5)
rdd.map(i => "getClassName in ClassB -> "+a.getClassName + s": $i")
.collect.foreach(println)
}
方法3:把ClassB修改成object修饰静态类
class ClassA extends Serializable {
def getClassName: String = this.getClass.getSimpleName
}
object ClassB extends Serializable{
val classA = new ClassA()
def fun(sc: SparkContext): Unit = {
val rdd = sc.makeRDD(1 to 5)
rdd.map(i => "getClassName in ClassB -> "+classA.getClassName + s": $i")
.collect.foreach(println)
}
}
object SerializableTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf(true)
.setMaster("local[*]")
.setAppName("SerializableTest")
.set("spark.rdd.compress", "true")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
lazy val sc = new SparkContext(conf)
val classB = ClassB
classB.fun(sc)
}
}
总结
通过上面填坑过程发现如下规律:
1、在rdd应该外部变量类实例的时候,类需要继承Serializable
2、在非静态类中(class声明的类),若是类属性是一个对象,则该属性不能在rdd里面直接使用,尽管该对象是已经继承了Serializable,可以直接在rdd前将该属性赋值为一个变量,再在rdd里面调用该变量
啰里啰嗦了这么多,spark序列化问题就此告一段落,有不明白之处欢迎comment