Spark中很多对象在通过网络传输或者写入存储体系时,都需要序列化。SparkEnv中有两个序列化的组件,分别是SerializerManager和closureSerializer,SparkEnv中创建它们的代码如下:
val serializer = instantiateClassFromConf[Serializer](
"spark.serializer", "org.apache.spark.serializer.JavaSerializer")
logDebug(s"Using serializer: ${serializer.getClass}")
val serializerManager = new SerializerManager(serializer, conf)
val closureSerializer = new JavaSerializer(conf)
这里创建的serializer默认为org.apache.spark.serializer.JavaSerializer,用户可以通过spark.serializer属性配置其它的序列化实现,如org.apache.spark.serializer.KryoSerializer。closureSerializer的实际类型固定为org.apache.spark.serializer.JavaSerializer,用户不能够自己指定。JavaSerializer采用Java语言自带的序列化API实现。
1 序列化管理器SerializerManager的属性
SerializerManager给各种Spark组件提供序列化、压缩及加密的服务。这里主要对SerializerManager中的各个成员属性进行介绍
private[spark] class SerializerManager(defaultSerializer: Serializer, conf: SparkConf) {
private[this] val kryoSerializer = new KryoSerializer(conf)
private[this] val stringClassTag: ClassTag[String] = implicitly[ClassTag[String]]
private[this] val primitiveAndPrimitiveArrayClassTags: Set[ClassTag[_]] = {
val primitiveClassTags = Set[ClassTag[_]](
ClassTag.Boolean,
ClassTag.Byte,
ClassTag.Char,
ClassTag.Double,
ClassTag.Float,
ClassTag.Int,
ClassTag.Long,
ClassTag.Null,
ClassTag.Short
)
val arrayClassTags = primitiveClassTags.map(_.wrap)
primitiveClassTags ++ arrayClassTags
}
private[this] val compressBroadcast = conf.getBoolean("spark.broadcast.compress", true)
private[this] val compressShuffle = conf.getBoolean("spark.shuffle.compress", true)
private[this] val compressRdds = conf.getBoolean("spark.rdd.compress", false)
private[this] val compressShuffleSpill = conf.getBoolean("spark.shuffle.spill.compress", true)
private lazy val compressionCodec: CompressionCodec = CompressionCodec.createCodec(conf)
.
.
}
- defaultSerializer:默认的序列化器,即为上面代码中实例化后的serializer,其类型为JavaSerializer
- conf:即SparkConf
- encruyptionKey:加密使用的密钥
- kryoSerializer:Spark提供的另一种序列化器,类型为KryoSerializer,其采用Google提供的Kryo序列化库实现
- stringClassTag:字符串类型标记,即ClassTag[String]
- primitiveAndPrimitiveArrayClassTags:原生类型及原生类型数组的类型标记的集合,包括:Boolean、Array[boolean]、Int、Array[int]、Long、Array[long]、Byte、Array[byte]、Null、Array[scala.runtime.Null$]、Char、Array[char]、Double、Array[double]、Float、Array[float]、Short、Array[short]等
- compressBroadcast:是否对广播对象进行压缩,可以通过spark.broadcast.compress属性配置,默认为true
- compressShuffle:是否对Shuffle输出数据压缩,可以通过spark.shuffle.compress属性配置,默认为true
- compressRdds:是否对RDD压缩,可以通过spark.rdd.compress属性配置,默认为false
- compressShuffleSpill:是否对举出磁盘的Shuffle数据压缩,可以通过spark.shuffle.spill.compress属性配置,默认为true
- compressionCodec:SerializerManager使用压缩编解码器,compressionCodec的类型是CompressionCodec。
2 创建CompressionCodec
为了节省磁盘存储空间,有些情况下需要对数据进行压缩。在SerializerManager中创建compressionCodec的代码如下:
private lazy val compressionCodec: CompressionCodec = CompressionCodec.createCodec(conf)
可看到compressionCodec通过lazy关键字修饰为延迟初始化,即等到真正使用时才对其初始化。CompressionCodec的createCodec方法用于创建CompressionCodec,其实现如代码清单:
private val shortCompressionCodecNames = Map(
"lz4" -> classOf[LZ4CompressionCodec].getName,
"lzf" -> classOf[LZFCompressionCodec].getName,
"snappy" -> classOf[SnappyCompressionCodec].getName)
def getCodecName(conf: SparkConf): String = {
conf.get(configKey, DEFAULT_COMPRESSION_CODEC)
}
def createCodec(conf: SparkConf): CompressionCodec = {
createCodec(conf, getCodecName(conf))
}
def createCodec(conf: SparkConf, codecName: String): CompressionCodec = {
val codecClass = shortCompressionCodecNames.getOrElse(codecName.toLowerCase, codecName)
val codec = try {
val ctor = Utils.classForName(codecClass).getConstructor(classOf[SparkConf])
Some(ctor.newInstance(conf).asInstanceOf[CompressionCodec])
} catch {
case e: ClassNotFoundException => None
case e: IllegalArgumentException => None
}
codec.getOrElse(throw new IllegalArgumentException(s"Codec [$codecName] is not available. " +
s"Consider setting $configKey=$FALLBACK_COMPRESSION_CODEC"))
}
根据上述代码,createCodec方法首先调用了getCodecName()方法获取编解码器的名称。其中的变量configKey为spark.io.compression.codec,即可以使用此属性配置压缩需要的编解码器名称。如果没有指定spark.io.compression.codec,那么编解码器的默认名称为lz4(即常量DEFAULT_COMPRESSION_CODEC的值),然后调用重载的createCodec方法,其执行步骤如下:
- 1)从shortCompressionCodecNames缓存中获取编解码器名称对应的编解码器的类名
- 2)通过Java反射生成编解码器的实例
3 序列化管理器SerializerManager的方法
在SerializerManager中提供了很多用于序列化、反序列化、压缩、加密的方法:
//对于指定的类型标记ct,是否能使用kryoSerializer进行序列化
//当类型标记ct属于primitiveAndPrimitiveArrayClassTags或者stringClassTag时,canUseKryo方法才返回真
def canUseKryo(ct: ClassTag[_]): Boolean = {
primitiveAndPrimitiveArrayClassTags.contains(ct) || ct == stringClassTag
}
//获取序列化器
//如果canUseKryo(ct)为true时,选择kryoSerializer,否则选择defaultSerializer
def getSerializer(ct: ClassTag[_]): Serializer = {
if (canUseKryo(ct)) {
kryoSerializer
} else {
defaultSerializer
}
}
//获取序列化器
def getSerializer(keyClassTag: ClassTag[_], valueClassTag: ClassTag[_]): Serializer = {
if (canUseKryo(keyClassTag) && canUseKryo(valueClassTag)) {
kryoSerializer
} else {
defaultSerializer
}
}
private def shouldCompress(blockId: BlockId): Boolean = {
blockId match {
case _: ShuffleBlockId => compressShuffle
case _: BroadcastBlockId => compressBroadcast
case _: RDDBlockId => compressRdds
case _: TempLocalBlockId => compressShuffleSpill
case _: TempShuffleBlockId => compressShuffle
case _ => false
}
}
//对输出流进行压缩与加密
def wrapForCompression(blockId: BlockId, s: OutputStream): OutputStream = {
if (shouldCompress(blockId)) compressionCodec.compressedOutputStream(s) else s
}
//对输入流进行压缩与加密
def wrapForCompression(blockId: BlockId, s: InputStream): InputStream = {
if (shouldCompress(blockId)) compressionCodec.compressedInputStream(s) else s
}
//对Block输出流进行序列化
def dataSerializeStream[T: ClassTag](
blockId: BlockId,
outputStream: OutputStream,
values: Iterator[T]): Unit = {
val byteStream = new BufferedOutputStream(outputStream)
val ser = getSerializer(implicitly[ClassTag[T]]).newInstance()
ser.serializeStream(wrapForCompression(blockId, byteStream)).writeAll(values).close()
}
//序列化成分块字节缓冲区
def dataSerialize[T: ClassTag](blockId: BlockId, values: Iterator[T]): ChunkedByteBuffer = {
dataSerializeWithExplicitClassTag(blockId, values, implicitly[ClassTag[T]])
}
//使用明确的类型标记,序列化成分块字节缓冲区
def dataSerializeWithExplicitClassTag(
blockId: BlockId,
values: Iterator[_],
classTag: ClassTag[_]): ChunkedByteBuffer = {
val bbos = new ChunkedByteBufferOutputStream(1024 * 1024 * 4, ByteBuffer.allocate)
val byteStream = new BufferedOutputStream(bbos)
val ser = getSerializer(classTag).newInstance()
ser.serializeStream(wrapForCompression(blockId, byteStream)).writeAll(values).close()
bbos.toChunkedByteBuffer
}
//将输入流反序列化为值的迭代器
def dataDeserializeStream[T](
blockId: BlockId,
inputStream: InputStream)
(classTag: ClassTag[T]): Iterator[T] = {
val stream = new BufferedInputStream(inputStream)
getSerializer(classTag)
.newInstance()
.deserializeStream(wrapForCompression(blockId, stream))
.asIterator.asInstanceOf[Iterator[T]]
}