Hadoop IO包序列化和反序列化
由于java序列化机制计算量开销大,且序列化的结果体积大太。Hadoop在集群之间进行通讯或者RPC调用的时候,需要序列化,而且要求序列化要快,且体积要小,占用带宽要小,所以不适合Hadoop。因此,hadoop中并没有使用Java自带的基本数据类型,并且Hadoop拥有一套自己序列化机制。Writable接口是基于Java中I/O(DataInput和DataOutPut),可以将数据写入流或者从流中读出,是一个简单有效的,紧凑、快速的序列化协议接口,完成Hadoop中key和value传递的序列化和反序列化工作。
下图给出了庞大的org.apache.hadoop.io中对象的关系:
Writable
Writable接口基于DataInput和DataOuput实现了简单、高效的序列化协议。在Hadoop中Map-Reducer框架的所有key和value都继承这个接口。源码如下:
Writable包含两个方法:write(DataOutput out) 和 readFields(DataInput in)。前者用于序列化对象字段到DataOutput,后者反序列化对象字段到DataInput。
public interface Writable {
/**
* Serialize the fields of this object to <code>out</code>.
*
* @param out <code>DataOuput</code> to serialize this object into.
* @throws IOException
*/
void write(DataOutput out) throws IOException;
/**
* Deserialize the fields of this object from <code>in</code>.
*
* <p>For efficiency, implementations should attempt to re-use storage in the
* existing object where possible.</p>
*
* @param in <code>DataInput</code> to deseriablize this object from.
* @throws IOException
*/
void readFields(DataInput in) throws IOException;
}
WritableComparable
WritableComparable继承了Java的Comparable接口,可以重写compareTo(T o)方法,使其具有可比性。一般通过Comparator相互比较,在Hadoop中Map-Reducer框架的所有key均继承这个接口。能做Key的一定可以做Value,能做Value的未必能做Key。
注意hadoop经常使用hashCode()方法对key做分区。,实现hashCode()方法跨JVM的不同实例返回相同的结果非常重要,hashCode()并不满足这个要求。参考如下例子:
public class MyWritableComparable implements WritableComparable<MyWritableComparable> {
// Some data
private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public int compareTo(MyWritableComparable o) {
int thisValue = this.value;
int thatValue = o.value;
return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime result + counter;
result = prime result + (int) (timestamp ^ (timestamp >>> 32));
return result
}
}
WritableComparator
WritableComparablede比较器,自定义排序的规则需要覆盖复compare(WritableComparable,WritableComparable)方法。覆盖compare(byte[],int,int,byte[],int,int)方法可以用于优化比较密集型操作。使用前者在key之间比较的时候要见key反序列化,而后者则不需要反序列化,比较两个二进制的对象,因而速度和效率优于前者。
RawComparator
RawComparator 直接在对象的字节表示上进行比较的比较器。仅提供了比较两个二进制对象的compare方法。
public interface RawComparator<T> extends Comparator<T> {
/**
* Compare two objects in binary.
* b1[s1:l1] is the first object, and b2[s2:l2] is the second object.
*
* @param b1 The first byte array.
* @param s1 The position index in b1. The object under comparison's starting index.
* @param l1 The length of the object in b1.
* @param b2 The second byte array.
* @param s2 The position index in b2. The object under comparison's starting index.
* @param l2 The length of the object under comparison in b2.
* @return An integer result of the comparison.
*/
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}
总结
在Hadoop中,主要应用于进程通信和永久存储。进程将对象序列化为字节流,通过网络传输到另一进程,另一进程接收到字节流,通过反序列化转回到结构化对象,以达到进程间通信。Mapper、Combine、Reduce等过程中,都需要使用序列化和反序列化技术。它也是hadoop的核心技术之一,Writable是hadoop序列化实现。