序列化有三种主要的用途:
(1) 作为一种持久化格式:一个对象被序列化以后,它的编码可以被存储到磁盘上,供以后反序列化用;
(2) 作为一种通信数据格式:序列化结果可以从一个正在运行的虚拟机,通过网络被传递到另一个虚拟机上;
(3) 作为一种拷贝、克隆机制:将对象序列化到内存的缓冲区中,然后通过反序列化,可以得到一个对已存对象进行深拷贝的新对象。
在分布式数据处理中,主要使用前面两种。
Java提供了内建的序列化机制,但是对于处理大规模数据的Hadoop平台,起序列化机制需要具有如下特征:
(1) 紧凑:一个紧凑的序列化机制可以充分利用数据中心的带宽;
(2) 快速:随着系统的发展,系统通信的协议会升级,类的定义会发生变化,序列化机制需要支持这些升级和变化;
(3) 互操作:可以支持不同开发语言间的通信。
Java的序列化机制虽然强大,却不符合上面的这些要求。
Hadoop引入org.apache.hadoop.io.Writable接口,作为所有可序列化对象必须实现的接口.
public interface Writable {
/**
* Serialize the fields of this object to <code>out</code>.
*
* @param out <code>DataOuput</code> to serialize this object into.
* @throws IOException
*/
void write(DataOutput out) throws IOException;
/**
* Deserialize the fields of this object from <code>in</code>.
*
* <p>For efficiency, implementations should attempt to re-use storage in the
* existing object where possible.</p>
*
* @param in <code>DataInput</code> to deseriablize this object from.
* @throws IOException
*/
void readFields(DataInput in) throws IOException;
}
Hadoop序列化机制中还包括另外几个重要的接口:WritableComparable、RawComparato和WritableComparator。
Java 基本类型的Writable封装
目前Java基本类型对应的Writable封装如表所示。
Java基本类型 Writable 序列化后长度
Boolean BooleanWritable 1
Byte ByteWritable 1
Int IntWritable/VIntWritable 4/1-5
下面以VIntWritable为例,说明Writable的Java基本类封装实现,代码如下:
public class VIntWritable implements WritableComparable {
private int value;
public VIntWritable() {}
public VIntWritable(int value) { set(value); }
/** Set the value of this VIntWritable. */
public void set(int value) { this.value = value; }
/** Return the value of this VIntWritable. */
public int get() { return value; }
public void readFields(DataInput in) throws IOException {
value = WritableUtils.readVInt(in);
}
public void write(DataOutput out) throws IOException {
WritableUtils.writeVInt(out, value);
}
…
}
VIntWritable通过WritableUtils类提供的readVInt()和writeVInt()方法读/写数据,而readVInt()和writeVInt()的实现也只是简单的调用了readVLong()和writeVLong()。
writeVLong()方法实现了对整形数值的变长编码,它的编码规则如下:
* Serializes a long to a binary stream with zero-compressed encoding.
* For -112 <= i <= 127, only one byte is used with the actual value.
* For other values of i, the first byte value indicates whether the
* long is positive or negative, and the number of bytes that follow.
* If the first byte value v is between -113 and -120, the following long
* is positive, with number of bytes that follow are -(v+112).
* If the first byte value v is between -121 and -128, the following long
* is negative, with number of bytes that follow are -(v+120). Bytes are
* stored in the high-non-zero-byte-first order.
代码如下:
public static void writeVLong(DataOutput stream, long i) throws IOException {
if (i >= -112 && i <= 127) {
stream.writeByte((byte)i);
return;
}
int len = -112;
if (i < 0) {
i ^= -1L; // take one's complement'
len = -120;
}
long tmp = i;
while (tmp != 0) {
tmp = tmp >> 8;
len--;
}
stream.writeByte((byte)len);
len = (len < -120) ? -(len + 120) : -(len + 112);
for (int idx = len; idx != 0; idx--) {
int shiftbits = (idx - 1) * 8;
long mask = 0xFFL << shiftbits;
stream.writeByte((byte)((i & mask) >> shiftbits));
}
}
ObjectWritable类的实现
ObjectWritable可应用于Hadoop远程过程调用中参数的序列化和反序列化,另一个典型应用是在需要序列化不同类型的对象某一个字段,如在一个SequenceFile的值中保存不同的对象(如LongWritable或Text)时,可以将该值声明为ObjectWritable。
ObjectWritable的实现比较冗长,需要根据可能被封装在ObjectWritable中的各种对象进行不同的处理。ObjectWritable有三个成员变量,包括被封装的对象实例instance、该对象运行时类的Class对象和Configuration对象。
ObjectWritable的write方法调用的是静态方法ObjectWritable.writeObject(),该方法可以往DataInnput接口中写入各种Java对象。
writeObject()方法先输出对象的类名,然后根据传入对象的类型,分情况序列化对象到输出流中,也就是说。为什么要先输出对象的实际类名呢?根据Java的单一继承原则,ObjectWritable中传入的declaredClass,可以是传入instance对象对应的类的类对象,也可以是instance对象的父类的类对象。但是,在序列化和反序列化的时候,往往不能使用父类的序列化方法(如write),所以必须记住对象的实际类名。
public static void writeVLong(DataOutput stream, long i) throws IOException {
if (i >= -112 && i <= 127) {
stream.writeByte((byte)i);
return;
}
int len = -112;
if (i < 0) {
i ^= -1L; // take one's complement'
len = -120;
}
long tmp = i;
while (tmp != 0) {
tmp = tmp >> 8;
len--;
}
stream.writeByte((byte)len);
len = (len < -120) ? -(len + 120) : -(len + 112);
for (int idx = len; idx != 0; idx--) {
int shiftbits = (idx - 1) * 8;
long mask = 0xFFL << shiftbits;
stream.writeByte((byte)((i & mask) >> shiftbits));
}
}
WritableFactories类允许非公有的Writable子类定义一个对象工厂,由该工厂创建Writable对象,通过WritableFactories的静态方法newInstance()。相关代码如下:
public class WritableFactories {
private static final HashMap<Class, WritableFactory> CLASS_TO_FACTORY =
new HashMap<Class, WritableFactory>();
…
/** Create a new instance of a class with a defined factory. */
public static Writable newInstance(Class<? extends Writable> c, Configuration conf) {
WritableFactory factory = WritableFactories.getFactory(c);
if (factory != null) {
Writable result = factory.newInstance();
if (result instanceof Configurable) {
((Configurable) result).setConf(conf);
}
return result;
} else {
return ReflectionUtils.newInstance(c, conf);
}
}
…
}
WritableFactories提供注册机制,使得这些Writable子类可以将该工厂登记到WritableFactories的静态成员变量CLASS_TO_FACTORY中。下面是一个典型的WritableFactory工厂实现。
public class Block implements Writable, Comparable<Block> {
static { // register a ctor
WritableFactories.setFactory
(Block.class,
new WritableFactory() {
public Writable newInstance() { return new Block(); }
});
}
…
}
Hadoop 序列化框架
Hadoop Avro avro.apache.org
Apache Thrift thrift.apache.org
Google Protocol Buffer