hadoop实现原理 (二)序列化

最新推荐文章于 2022-01-01 22:13:51 发布

imck

最新推荐文章于 2022-01-01 22:13:51 发布

阅读量632

点赞数

分类专栏： hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/u012308776/article/details/44907275

版权

hadoop 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

序列化有三种主要的用途：
(1) 作为一种持久化格式：一个对象被序列化以后，它的编码可以被存储到磁盘上，供以后反序列化用；
(2) 作为一种通信数据格式：序列化结果可以从一个正在运行的虚拟机，通过网络被传递到另一个虚拟机上；
(3) 作为一种拷贝、克隆机制：将对象序列化到内存的缓冲区中，然后通过反序列化，可以得到一个对已存对象进行深拷贝的新对象。
在分布式数据处理中，主要使用前面两种。
Java提供了内建的序列化机制，但是对于处理大规模数据的Hadoop平台，起序列化机制需要具有如下特征：
(1) 紧凑：一个紧凑的序列化机制可以充分利用数据中心的带宽；
(2) 快速：随着系统的发展，系统通信的协议会升级，类的定义会发生变化，序列化机制需要支持这些升级和变化；
(3) 互操作：可以支持不同开发语言间的通信。
Java的序列化机制虽然强大，却不符合上面的这些要求。
Hadoop引入org.apache.hadoop.io.Writable接口，作为所有可序列化对象必须实现的接口.

public interface Writable {
  /** 
   * Serialize the fields of this object to <code>out</code>.
   * 
   * @param out <code>DataOuput</code> to serialize this object into.
   * @throws IOException
   */
  void write(DataOutput out) throws IOException;

  /** 
   * Deserialize the fields of this object from <code>in</code>.  
   * 
   * <p>For efficiency, implementations should attempt to re-use storage in the 
   * existing object where possible.</p>
   * 
   * @param in <code>DataInput</code> to deseriablize this object from.
   * @throws IOException
   */
  void readFields(DataInput in) throws IOException;
}

Hadoop序列化机制中还包括另外几个重要的接口：WritableComparable、RawComparato和WritableComparator。

Java 基本类型的Writable封装
目前Java基本类型对应的Writable封装如表所示。
Java基本类型 Writable 序列化后长度
Boolean BooleanWritable 1
Byte ByteWritable 1
Int IntWritable/VIntWritable 4/1-5

下面以VIntWritable为例，说明Writable的Java基本类封装实现，代码如下：

public class VIntWritable implements WritableComparable {
  private int value;

  public VIntWritable() {}

  public VIntWritable(int value) { set(value); }

  /** Set the value of this VIntWritable. */
  public void set(int value) { this.value = value; }

  /** Return the value of this VIntWritable. */
  public int get() { return value; }

  public void readFields(DataInput in) throws IOException {
    value = WritableUtils.readVInt(in);
  }

  public void write(DataOutput out) throws IOException {
    WritableUtils.writeVInt(out, value);
  }
…
}

VIntWritable通过WritableUtils类提供的readVInt()和writeVInt()方法读/写数据，而readVInt()和writeVInt()的实现也只是简单的调用了readVLong()和writeVLong()。
writeVLong()方法实现了对整形数值的变长编码，它的编码规则如下：
* Serializes a long to a binary stream with zero-compressed encoding.
* For -112 <= i <= 127, only one byte is used with the actual value.
* For other values of i, the first byte value indicates whether the
* long is positive or negative, and the number of bytes that follow.
* If the first byte value v is between -113 and -120, the following long
* is positive, with number of bytes that follow are -(v+112).
* If the first byte value v is between -121 and -128, the following long
* is negative, with number of bytes that follow are -(v+120). Bytes are
* stored in the high-non-zero-byte-first order.
代码如下：

public static void writeVLong(DataOutput stream, long i) throws IOException {
    if (i >= -112 && i <= 127) {
      stream.writeByte((byte)i);
      return;
    }

    int len = -112;
    if (i < 0) {
      i ^= -1L; // take one's complement'
      len = -120;
    }

    long tmp = i;
    while (tmp != 0) {
      tmp = tmp >> 8;
      len--;
    }

    stream.writeByte((byte)len);

    len = (len < -120) ? -(len + 120) : -(len + 112);

    for (int idx = len; idx != 0; idx--) {
      int shiftbits = (idx - 1) * 8;
      long mask = 0xFFL << shiftbits;
      stream.writeByte((byte)((i & mask) >> shiftbits));
    }
  }

ObjectWritable类的实现
ObjectWritable可应用于Hadoop远程过程调用中参数的序列化和反序列化，另一个典型应用是在需要序列化不同类型的对象某一个字段，如在一个SequenceFile的值中保存不同的对象(如LongWritable或Text)时，可以将该值声明为ObjectWritable。
ObjectWritable的实现比较冗长，需要根据可能被封装在ObjectWritable中的各种对象进行不同的处理。ObjectWritable有三个成员变量，包括被封装的对象实例instance、该对象运行时类的Class对象和Configuration对象。
ObjectWritable的write方法调用的是静态方法ObjectWritable.writeObject()，该方法可以往DataInnput接口中写入各种Java对象。
writeObject()方法先输出对象的类名，然后根据传入对象的类型，分情况序列化对象到输出流中，也就是说。为什么要先输出对象的实际类名呢？根据Java的单一继承原则，ObjectWritable中传入的declaredClass,可以是传入instance对象对应的类的类对象，也可以是instance对象的父类的类对象。但是，在序列化和反序列化的时候，往往不能使用父类的序列化方法(如write)，所以必须记住对象的实际类名。

public static void writeVLong(DataOutput stream, long i) throws IOException {
    if (i >= -112 && i <= 127) {
      stream.writeByte((byte)i);
      return;
    }

    int len = -112;
    if (i < 0) {
      i ^= -1L; // take one's complement'
      len = -120;
    }

    long tmp = i;
    while (tmp != 0) {
      tmp = tmp >> 8;
      len--;
    }

    stream.writeByte((byte)len);

    len = (len < -120) ? -(len + 120) : -(len + 112);

    for (int idx = len; idx != 0; idx--) {
      int shiftbits = (idx - 1) * 8;
      long mask = 0xFFL << shiftbits;
      stream.writeByte((byte)((i & mask) >> shiftbits));
    }
  }

WritableFactories类允许非公有的Writable子类定义一个对象工厂，由该工厂创建Writable对象，通过WritableFactories的静态方法newInstance(）。相关代码如下：

public class WritableFactories {
  private static final HashMap<Class, WritableFactory> CLASS_TO_FACTORY =
    new HashMap<Class, WritableFactory>();
…

  /** Create a new instance of a class with a defined factory. */
  public static Writable newInstance(Class<? extends Writable> c, Configuration conf) {
    WritableFactory factory = WritableFactories.getFactory(c);
    if (factory != null) {
      Writable result = factory.newInstance();
      if (result instanceof Configurable) {
        ((Configurable) result).setConf(conf);
      }
      return result;
    } else {
      return ReflectionUtils.newInstance(c, conf);
    }
  }
…
}

WritableFactories提供注册机制，使得这些Writable子类可以将该工厂登记到WritableFactories的静态成员变量CLASS_TO_FACTORY中。下面是一个典型的WritableFactory工厂实现。

public class Block implements Writable, Comparable<Block> {

  static {                                      // register a ctor
    WritableFactories.setFactory
      (Block.class,
       new WritableFactory() {
         public Writable newInstance() { return new Block(); }
       });
  }
…
}