Spark-Core性能调优之用Spark实现KRYO序列化

最新推荐文章于 2024-05-28 09:41:37 发布

晓晓很可爱

最新推荐文章于 2024-05-28 09:41:37 发布

阅读量324

点赞数

分类专栏： spark-core 文章标签： spark

本文链接：https://blog.csdn.net/Fresh_man888/article/details/110672281

版权

spark-core 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

1.实现kryo序列化的目的:

为了降低driver端,到executor之间网络传输压力,可以使用spark自带的,kryo序列化方式,而不使用java的序列化,kryo序列化不仅可以节约内存空间,在网络传输上也可以节省网络资源,java的序列方式,不仅占用较大的内存空间,而且网络传输也比较慢.

2.官方文档为我们提供的序列化优化方式

Data Serialization

Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application. Spark aims to strike a balance between convenience (allowing you to work with any Java type in your operations) and performance. It provides two serialization libraries:

Java serialization: By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. You can also control the performance of your serialization more closely by extending java.io.Externalizable. Java serialization is flexible but often quite slow, and leads to large serialized formats for many classes.
Kryo serialization: Spark can also use the Kryo library (version 4) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.
You can switch to using Kryo by initializing your job with a SparkConf and calling conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). This setting configures the serializer used for not only shuffling data between worker nodes but also when serializing RDDs to disk. The only reason Kryo is not the default is because of the custom registration requirement, but we recommend trying it in any network-intensive application. Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.

Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered in the AllScalaRegistrar from the Twitter chill library.

To register your own custom classes with Kryo, use the registerKryoClasses method.

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
To register your own custom classes with Kryo, use the registerKryoClasses method.

The Kryo documentation describes more advanced registration options, such as adding custom serialization code.

If your objects are large, you may also need to increase the spark.kryoserializer.buffer config. This value needs to be large enough to hold the largest object you will serialize.

Finally, if you don’t register your custom classes, Kryo will still work, but it will have to store the full class name with each object, which is wasteful.

翻译之后:

数据序列化

序列化在任何分布式应用程序的性能中都起着重要作用。将对象序列化为慢速格式或占用大量字节的格式将大大减慢计算速度。通常，这是您应该优化Spark应用程序的第一件事。Spark旨在在便利性（允许您在操作中使用任何Java类型）和性能之间取得平衡。它提供了两个序列化库：

Java序列化：默认情况下，Spark使用Java的ObjectOutputStream框架对对象进行序列化，并且可以与您创建的实现了的任何类一起使用 java.io.Serializable。您还可以通过扩展来更紧密地控制序列化的性能 java.io.Externalizable。Java序列化很灵活，但是通常很慢，并且导致许多类的序列化格式很大。
Kryo序列化：Spark还可以使用Kryo库（版本4）更快地序列化对象。与Java序列化（通常多达10倍）相比，Kryo显着更快，更紧凑，但是Kryo不支持所有 Serializable类型，并且要求您预先注册要在程序中使用的类，以实现最佳性能。

您可以通过使用SparkConf初始化作业并调用来切换为使用Kryo conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")。此设置配置了不仅用于在工作节点之间改组数据而且还在将RDD序列化到磁盘时使用的序列化器。Kryo不是默认值的唯一原因是由于自定义注册要求，但是我们建议在任何网络密集型应用程序中尝试使用它。从Spark 2.0.0开始，在将RDD与简单类型，简单类型的数组或字符串类型进行混洗时，我们在内部使用Kryo序列化器。

Spark自动为Twitter chill库的AllScalaRegistrar中涵盖的许多常用Scala核心类包括Kryo序列化器。

要使用Kryo注册您自己的自定义类，请使用registerKryoClasses方法。
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
所述KRYO文档描述了更先进的注册选项，如添加自定义序列的代码。

如果对象很大，则可能还需要增加spark.kryoserializer.buffer config。该值必须足够大以容纳要序列化的最大对象。

最后，如果您不注册自定义类，Kryo仍然可以工作，但必须将完整的类名与每个对象一起存储，这很浪费。

3. 代码实现:

以简单的自定义排序为例;

package com.zxx.spark.day08

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import com.alibaba.fastjson.JSON
import com.zxx.spark.pojo.{Person, Teacher}

/**
 * 自定义排序的第一种方式
 */
object CustomSort3 {
  def main(args: Array[String]): Unit = {
    //先创建sparkContext链接对象
    //设置序列化方式为kryo
    val conf: SparkConf = new SparkConf().setAppName(this.getClass.getSimpleName).set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    //注册要序列化的类
    conf.registerKryoClasses(Array(classOf[Person]))
    //通过传入参数的方式,判断是否设置为本地执行
    val flag = args(0).toBoolean
    if (flag == true) {
      conf.setMaster("local[*]")
    }
    val sc: SparkContext = new SparkContext(conf)
    //创建一个RDD
    val lines: RDD[String] = sc.parallelize(List("{\"name\":\"zxx\",\n\"age\":18,\n\"fv\":100.0}", "{\"name\":\"dhh\",\n\"age\":28,\n\"fv\":99.0}", "{\"name\":\"jd\",\n\"age\":28,\n\"fv\":98.0}"))
    //对rdd中的数据进行切割,封装到javaBean中,rdd只是对数据的描述信息,具体对数据的处理逻辑是在executor中执行的
    val per: RDD[Person] = lines.map(e => {
      val person: Person = JSON.parseObject(e, classOf[Person])
      person
    })
    val s: RDD[Person] = per.sortBy(e => e)


    println(s.collect().toBuffer)


  }

}