最近用到spark的序列化,因为sc的paralize方法和transform和action都是分布式行为,所以存在driver与worker中的task的复制传递一些值,所以对序列化性能做一下测试。java的对象要想网络之间进行传输,必须序列化成字节数组才能进行传输
spark默认的序列化是Java自带的序列化器ObjectInputStream和ObjectOutputStream(主要是考虑了方便性或者通用性),在默认情况下如果自定义了RDD中数据元素的类型则必须实现Serializable接口,当然,也可以实现自己的序列化接口Externalizable来实现更加高效的Java序列化算法;采用默认ObjectInputStream和ObjectOutputStream会导致序列化后的数据占用大量的内存或者磁盘及大量消耗网络,且在序列化和反序列化的时候比较消耗CPU;
此次测试列举一下一些常用的序列化
1 java 默认的序列化 实现serialize接口
2 java的自定义序列化 实现Externalizable接口
3 java序列化框架 Kyro
4 java序列化框架 FST
当然其他还有一些fastjson,jackson,之类的在此先不做探讨
1首先针对默认的序列化测试,测试写入50W个User对象
public class DefaultUser implements Serializable{
private String username;
private String password;
private int age;
private Date birth;
public DefaultUser(String username, String password, int age, Date birth) {
this.username = username;
this.password = password;
this.age = age;
this.birth = birth;
}
public DefaultUser() {
}
}
public class JDfaultSerialize {
public static void main(String[] args) throws IOException, ClassNotFoundException {
ObjectOutputStream out = new ObjectOutputStream( new
FileOutputStream(
"E:\\data1.dat"));
ObjectInputStream in = new ObjectInputStream(new FileInputStream(
"E:\\data1.dat"));
ArrayList<Object> stu = new ArrayList<>();
for (int i=0;i<=500000;i++){
stu.add(new DefaultUser("gg","123",11,new Date()));
}
long start=new Date().getTime();
out.writeObject(stu);
out.flush(); out.close();
// ...
List<Object> someObject = (List<Object>) in.readObject();
System.out.println(someObject);
in.close();
System.out.println(new Date().getTime()-start);
}
}
时间 24503 文件大小 18,500,243K
2下面采用实现ExSerialize接口来自定义序列化
public class ExterUser implements Externalizable {
private String username;
private String password;
private int age;
private Date birth;
public ExterUser(String username, String password, int age, Date birth) {
this.username = username;
this.password = password;
this.age = age;
this.birth = birth;
}
public ExterUser() {
}
@Override
public void writeExternal(ObjectOutput stream) throws IOException {
stream.writeObject(this.username);
stream.writeObject(this.password);
stream.writeInt(this.age);
stream.writeObject(this.birth);
}
@Override
public void readExternal(ObjectInput stream) throws IOException, ClassNotFoundException {
this.username = (String)stream.readObject();
this.password = (String)stream.readObject();
this.age = stream.readInt();
this.birth = (Date)stream.readObject();
}
}
public class ExSerialize {
public static void main(String[] args) throws IOException, ClassNotFoundException {
ObjectOutputStream out = new ObjectOutputStream( new
FileOutputStream(
"E:\\data1.dat"));
ObjectInputStream in = new ObjectInputStream(new FileInputStream(
"E:\\data1.dat"));
ArrayList<Object> stu = new ArrayList<>();
for (int i=0;i<=500000;i++){
stu.add(new ExterUser("gg","123",11,new Date()));
}
long start=new Date().getTime();
out.writeObject(stu);
out.flush(); out.close();
// ...
List<Object> someObject = (List<Object>) in.readObject();
System.out.println(someObject);
in.close();
System.out.println(new Date().getTime()-start);
}
}
时间花费:31459 文件大小 20,000,163K
3针对kyro测试 kyro(规定必须有默认构造器,甚至可以不用实现serialize接口)
public class KyroUser{
private String username;
private String password;
private int age;
private Date birth;
public KyroUser(String username, String password, int age, Date birth) {
this.username = username;
this.password = password;
this.age = age;
this.birth = birth;
}
public KyroUser() {
}
}
public class KyroSerialize {
public static void main(String[] args) throws FileNotFoundException {
Kryo kryo = new Kryo();
Output output = new Output(new FileOutputStream("F://a.txt"));
Input input = new Input(new FileInputStream("f://a.txt"));
ArrayList<Object> objects = new ArrayList<>();
for (int i=0;i<=500000;i++){
objects.add(new KyroUser("gg","123",11,new Date()));
}
long start=new Date().getTime();
kryo.writeObject(output,objects);
output.flush(); output.close();
// ...
List<Object> someObject = kryo.readObject(input, ArrayList.class);
System.out.println(someObject);
input.close();
System.out.println(new Date().getTime()-start);
}
}
花费时间: 2864 文件大小 7,500,065K
4下面针对得是FST序列化框架
public class FTSUser implements Serializable{
private String username;
private String password;
private int age;
private Date birth;
public FTSUser(String username, String password, int age, Date birth) {
this.username = username;
this.password = password;
this.age = age;
this.birth = birth;
}
public FTSUser() {
}
}
public class FTSSerialize {
public static void main(String arg[]) throws IOException, ClassNotFoundException {
FileOutputStream fos=new FileOutputStream(new File("F://d.txt"));
FSTConfiguration configuration = FSTConfiguration.createDefaultConfiguration();
ArrayList<Object> stu = new ArrayList<>();
for (int i=0;i<=500000;i++){
stu.add(new FTSUser("gg","123",11,new Date()));
}
long start=new Date().getTime();
byte[] bytes = configuration.asByteArray(stu);
Object someObject = configuration.asObject(bytes);
out.println(someObject);
out.println(new Date().getTime()-start);
fos.write(bytes); fos.close();
}
}
花费时间 :2305 数组大小 9,500,058K
结果统计
java默认serialize花费时间 24503 文件大小 18,500,243K
java自定义exSerialize时间花费:31459 文件大小 20,000,163K
java kyro花费时间: 2864 文件大小 7,500,065K
java FST花费时间 :2305 数组大小 9,500,058K
java默认的序列化性能是非常低下的,并且java默认的序列化存在一些不安全行为
kryo的效率很高。今日测试性能接近java原生序列化的9倍左右,大小也是非常节省空间的,大小仅为默认序列化大小的0.5倍
FST 在测试中的性能也是很高,但是因为FST并没有广泛应用到产品中,稳定性还不清楚,推荐使用kyro