Hadoop IO包序列化和反序列化

最新推荐文章于 2024-08-12 09:54:50 发布

Lic_LiveTime

最新推荐文章于 2024-08-12 09:54:50 发布

阅读量375

点赞数 1

分类专栏： Hadoop 2.x源码

本文链接：https://blog.csdn.net/lic_livetime/article/details/80137958

版权

Hadoop 2.x源码专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Hadoop IO包序列化和反序列化

由于java序列化机制计算量开销大，且序列化的结果体积大太。Hadoop在集群之间进行通讯或者RPC调用的时候，需要序列化，而且要求序列化要快，且体积要小，占用带宽要小，所以不适合Hadoop。因此，hadoop中并没有使用Java自带的基本数据类型，并且Hadoop拥有一套自己序列化机制。Writable接口是基于Java中I/O（DataInput和DataOutPut），可以将数据写入流或者从流中读出，是一个简单有效的，紧凑、快速的序列化协议接口，完成Hadoop中key和value传递的序列化和反序列化工作。

下图给出了庞大的org.apache.hadoop.io中对象的关系：
这里写图片描述

Writable

Writable接口基于DataInput和DataOuput实现了简单、高效的序列化协议。在Hadoop中Map-Reducer框架的所有key和value都继承这个接口。源码如下：

Writable包含两个方法：write(DataOutput out) 和 readFields(DataInput in)。前者用于序列化对象字段到DataOutput，后者反序列化对象字段到DataInput。

public interface Writable {
  /** 
   * Serialize the fields of this object to <code>out</code>.
   * 
   * @param out <code>DataOuput</code> to serialize this object into.
   * @throws IOException
   */
  void write(DataOutput out) throws IOException;

  /** 
   * Deserialize the fields of this object from <code>in</code>.  
   * 
   * <p>For efficiency, implementations should attempt to re-use storage in the 
   * existing object where possible.</p>
   * 
   * @param in <code>DataInput</code> to deseriablize this object from.
   * @throws IOException
   */
  void readFields(DataInput in) throws IOException;
}

WritableComparable

WritableComparable继承了Java的Comparable接口，可以重写compareTo(T o)方法，使其具有可比性。一般通过Comparator相互比较，在Hadoop中Map-Reducer框架的所有key均继承这个接口。能做Key的一定可以做Value，能做Value的未必能做Key。

注意hadoop经常使用hashCode()方法对key做分区。，实现hashCode()方法跨JVM的不同实例返回相同的结果非常重要，hashCode()并不满足这个要求。参考如下例子：

public class MyWritableComparable implements    WritableComparable<MyWritableComparable> {
    // Some data
    private int counter;
    private long timestamp;

    public void write(DataOutput out) throws IOException {
      out.writeInt(counter);
      out.writeLong(timestamp);
    }

    public void readFields(DataInput in) throws IOException {
      counter = in.readInt();
      timestamp = in.readLong();
    }

    public int compareTo(MyWritableComparable o) {
      int thisValue = this.value;
      int thatValue = o.value;
      return (thisValue &lt; thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
    }

    public int hashCode() {
      final int prime = 31;
      int result = 1;
      result = prime  result + counter;
      result = prime  result + (int) (timestamp ^ (timestamp >>> 32));
      return result
    }
}

WritableComparator

WritableComparablede比较器，自定义排序的规则需要覆盖复compare(WritableComparable,WritableComparable)方法。覆盖compare(byte[],int,int,byte[],int,int)方法可以用于优化比较密集型操作。使用前者在key之间比较的时候要见key反序列化，而后者则不需要反序列化，比较两个二进制的对象,因而速度和效率优于前者。
这里写图片描述

RawComparator

RawComparator 直接在对象的字节表示上进行比较的比较器。仅提供了比较两个二进制对象的compare方法。

public interface RawComparator<T> extends Comparator<T> {

  /**
   * Compare two objects in binary.
   * b1[s1:l1] is the first object, and b2[s2:l2] is the second object.
   * 
   * @param b1 The first byte array.
   * @param s1 The position index in b1. The object under comparison's starting index.
   * @param l1 The length of the object in b1.
   * @param b2 The second byte array.
   * @param s2 The position index in b2. The object under comparison's starting index.
   * @param l2 The length of the object under comparison in b2.
   * @return An integer result of the comparison.
   */
  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

}

总结

在Hadoop中，主要应用于进程通信和永久存储。进程将对象序列化为字节流，通过网络传输到另一进程，另一进程接收到字节流，通过反序列化转回到结构化对象，以达到进程间通信。Mapper、Combine、Reduce等过程中，都需要使用序列化和反序列化技术。它也是hadoop的核心技术之一，Writable是hadoop序列化实现。