1.序列化常用于网络传输和数据持久化以便于存储和传输,Spark通过两种方式来创建序列化器
val serializer = instantiateClassFromConf[Serializer](
"spark.serializer", "org.apache.spark.serializer.JavaSerializer")
logDebug(s"Using serializer: ${serializer.getClass}")
//暂时在blockManager中没有用到该序列化方式
val closureSerializer = instantiateClassFromConf[Serializer](
"spark.closure.serializer", "org.apache.spark.serializer.JavaSerializer")
2.Spark中两种典型的序列化场景
序列化场景A:执行map等RDD操作时,首先执行cleanF,内部左F解析和F序列化
private def ensureSerializable(func: AnyRef) {
try {
if (SparkEnv.get != null) {
SparkEnv.get.closureSerializer.newInstance().serialize(func)
}
valclosureSerializer = instantiateClassFromConf[Serializer](
"spark.closure.serializer", "org.apache.spark.serializer.JavaSerializer")
结论:spark.closure.serializer配置决定了函数序列化的方式
序列化场景B:blockManager中
val blockManager = new BlockManager(executorId, rpcEnv, blockManagerMaster,
serializer, conf, memoryManager, mapOutputTracker, shuffleManager,
blockTransferService, securityManager, numUsableCores)
val serializer = instantiateClassFromConf[Serializer](
"spark.serializer", "org.apache.spark.serializer.JavaSerializer")
结论:spark.serializer决定了BlockManager工作中序列化的方式
3.Spark默认采用Java的序列化器,本人建议采用Kryo序列化提高性能,下面分析下这两种序列化机制的异同
A.首先看看java的序列化机制
override def serialize[T: ClassTag](t: T): ByteBuffer = {
val bos = new ByteArrayOutputStream()
val out = serializeStream(bos)
out.writeObject(t)
out.close()
ByteBuffer.wrap(bos.toByteArray)
}
B.再看看Kyro的序列化机制
//创建两个延迟执行的工作流(Kyro输入输出流)
private lazy val output = ks.newKryoOutput()
private lazy val input = new KryoInput()
override def serialize[T: ClassTag](t: T): ByteBuffer = {