一:配置
可以在spark-default.conf设置全局参数,也可以代码中初始化时对SparkConf设置 conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”) ,该参数会同时作用于机器之间数据的shuffle操作以及序列化rdd到磁盘,内存。
Spark不将Kyro设置成默认的序列化方式是因为它需要对类进行注册,官方强烈建议在一些网络数据传输很大的应用中使用kyro序列化。
You can switch to using Kryo by initializing your job with a SparkConf and calling conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”). This setting configures the serializer used for not only shuffling data between worker nodes
官网有提供:https://spark.apache.org/docs/latest/tuning.html
val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array(classOf[Student]))
val sc = new SparkContext(sparkConf)
二:操作
package g5.learning
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.storage.StorageLevel
import scala.collection.mutable.ListBuffer
case class Student(id:Int,name:String,age:Int)
object SerializationApp1 {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
// .setMaster("local[2]").setAppName("SerializationApp1")
// .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")这里写下shell里面不至于以后每次都要更改源码
.registerKryoClasses(Array(classOf[Student]))//这么操作没有问题,不过就是浪费资源
val sc = new SparkContext(sparkConf)
val students = ListBuffer[Student]()
for (i <- 1 to 10000){
students.append(Student(i,"ruoze"+i,39))
}
val studentRDD = sc.parallelize(students)
studentRDD.persist(StorageLevel.MEMORY_ONLY_SER)
studentRDD.count()
Thread.sleep(1000*20)//防止跑的太快看不到
sc.stop()
}
}
三:打包jar,上传jar,修改shell文件
[hadoop@hadoop001 shell]$ vi kryo_ser.sh
[hadoop@hadoop001 shell]$ rz
[hadoop@hadoop001 shell]$ vi kryo_ser.sh
export HADOOP_CONF_DIR=/home/hadoop/app/hadoop-2.6.0-cdh5.7.0/etc/hadoop
$SPARK_HOME/bin/spark-submit \
--master local[2] \
--class g5.learning.SerializationApp1 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--name SerializationApp1 \
/home/hadoop/lib/g5spark1-1.0.jar \