Spark源码学习——Data Serialization

最新推荐文章于 2021-07-01 14:48:46 发布

不会打球的摄影师不是好程序员

最新推荐文章于 2021-07-01 14:48:46 发布

阅读量164

点赞数 2

分类专栏： Spark 文章标签： spark

本文链接：https://blog.csdn.net/weixin_48929324/article/details/116047288

版权

Spark 专栏收录该内容

12 篇文章 0 订阅

订阅专栏

源码的内容

Data Serialization

Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application. Spark aims to strike a balance between convenience (allowing you to work with any Java type in your operations) and performance. It provides two serialization libraries:
翻译：序列化在任何分布式应用程序的性能中都扮演着重要的角色。将对象序列化得很慢或消耗大量字节的格式会大大降低计算速度。通常，这是优化Spark应用程序时首先应该做的事情。Spark的目标是在便利性(允许您在操作中使用任何Java类型)和性能之间取得平衡。提供了两个序列化库:

1.Java serialization:

By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. You can also control the performance of your serialization more closely by extending java.io.Externalizable. Java serialization is flexible but often quite slow, and leads to large serialized formats for many classes.

翻译：Java serialization是默认的序列化方式，Spark的序列化使用Java的ObjectOutputStream框架，并且可以与你创建的任何类（实现了 java.io.Serializable）一起工作。您还可以通过扩展java.io.Externalizable来更紧密地控制序列化的性能。Java序列化很灵活，但通常相当缓慢，并导致许多类的大型序列化格式。
（总结：Java serialization的序列化方式是spark默认的序列化方式，很灵活，但是很慢）

2.Kryo serialization:

Spark can also use the Kryo library (version 4) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.

翻译：Spark还可以使用Kryo库，更快地序列化对象。Kryo比Java序列化快得多，也紧凑得多(通常高达10倍)，但并不支持所有的Serializable类型，并且要求您提前注册将在程序中使用的类以获得最佳性能。
（总结：kryo serialization序列化方式更快，但是不够灵活）

You can switch to using Kryo by initializing your job with a SparkConf and calling conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”). This setting configures the serializer used for not only shuffling data between worker nodes but also when serializing RDDs to disk. The only reason Kryo is not the default is because of the custom registration requirement, but we recommend trying it in any network-intensive application. Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.

翻译：你可以切换使用Kryo通过使用SparkConf并调用conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”). 来初始化你的作业。
。此设置配置的序列化器不仅用于在工作节点之间转移数据，还用于将rdd序列化到磁盘。Kryo不是默认设置的唯一原因是自定义注册要求，但我们建议尝试它。从Spark 2.0.0开始，当使用简单类型、简单类型数组或字符串类型变换rdd时，我们在内部使用Kryo序列化器。

To register your own custom classes with Kryo, use the registerKryoClasses method.

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)

总结：

1.Spark提供了两种序列化的方式：Java serializa和kryo serialization
2.Java serialization是spark默认的序列化方式，很灵活，但是速度很慢
3.kryo serialization速度很快，但是不够灵活，使用时需要提前注册

不会打球的摄影师不是好程序员

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Spark源码学习——Data Serialization

源码的内容Data SerializationSerialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will
复制链接

扫一扫