Hadoop-1.2.1 源码分析2 (Hadoop IO模块)-CSDN博客

由于Hadoop的MapReduce和HDFS都有通信的需求，所以需要对通信的对象进行序列化. Hadoop并没有采用Java的序列化，而是引入了它自己的序列化系统.

org.apache.hadoop.io包中定义了大量的可序列化对象，这些对象都实现了 Writable 接口. Writable 接口是序列化对象的一个通用接口.

1 数据类型接口

1.1 Writable接口

所有实现了 writable 接口的类都可以被序列化和反序列化，典型例子如下所示：

 1 public class MyWritable implements Writable {
 2   // Some data     
 3   private int counter;
 4   private long timestamp;
 5   
 6   //实现Writable接口的序列化方法
 7   public void write(DataOutput out) throws IOException {
 8     out.writeInt(counter);
 9     out.writeLong(timestamp);
10   }
11   
12   //实现Writable接口的反序列化方法
13   public void readFields(DataInput in) throws IOException {
14     counter = in.readInt();
15     timestamp = in.readLong();
16   }
17   
18   public static MyWritable read(DataInput in) throws IOException {
19     MyWritable w = new MyWritable();
20     w.readFields(in);
21     return w;
22   }
23 }

View Code

1.2 Comparable接口

Comparable 是jdk中java.lang包下的接口，所有实现了 Comparable 接口的对象都可以和自身相同类型的对象进行比较. 该接口只有一个方法：

1 public interface Comparable<T> {
2     public int compareTo(T o);
3 }

View Code

该方法将自身对象 this 和待比较对象o比较. 若小于对象o，返回负数；若等于，返回0；若大于对象o，返回正数.

备注：该接口和 java.util.Comparator 接口的区别

(1)类的设计师没有考虑到比较问题而没有实现 Comparable，可以通过 Comparator 来实现排序而不必改变对象本身

(2)Comparable 用作默认的比较方式，Comparator 用作自定义的比较方式. 当默认的比较方式不适用时或者没有提供默认的比较方式，使用Comparator就非常有用

(3)可以使用多种排序标准，比如升序、降序等

(4)像 Arrays 和 Collections 中的排序方法，当不指定Comparator时使用的就是默认排序方式，也就是使用Comparable，指定Comparator时就是使用提供的比较器
Arrays.sort(Object[]) --> 所有的待比较对象都必须实现 Comparable 接口，它用来确定对象之间的大小关系
Arrays.sort(Object[], Comparator) --> 待比较对象不必实现 Comparable 接口，由 Comparator 来确定对象之间的大小关系

1.3 WritableComparable接口

该接口继承 Writable 和 Comparable 接口：

 1 //WritableComparable 接口源码
 2 public interface WritableComparable<T> extends Writable, Comparable<T> {}
 3 
 4 
 5 
 6 //实现 WritableComparable 接口的Demo
 7 public class MyWritableComparable implements WritableComparable {
 8   // Some data
 9   private int counter;
10   private long timestamp;
11   
12   public void write(DataOutput out) throws IOException {
13     out.writeInt(counter);
14     out.writeLong(timestamp);
15   }
16   
17   public void readFields(DataInput in) throws IOException {
18     counter = in.readInt();
19     timestamp = in.readLong();
20   }
21   
22   public int compareTo(MyWritableComparable w) {
23     int thisValue = this.value;
24     int thatValue = ((IntWritable)o).value;
25     return (thisValue &lt; thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
26   }
27 }

View Code

1.4 RawComparator接口

Hadoop在 MapReduce 过程中，类型的比较是很重要的，如：在排序阶段中 Key 和 Key 的比较等. RawComparator 接口就是为了优化该过程，实现该接口后可以直接比较

数据流中的记录，而无需再反序列化数据流中的数据了.

 1 package org.apache.hadoop.io;
 2 
 3 import java.util.Comparator;
 4 
 5 import org.apache.hadoop.io.serializer.DeserializerComparator;
 6 
 7 //注意是java.util.Comparator接口，不是java.lang.Comparable接口
 8 public interface RawComparator<T> extends Comparator<T> {
 9 
10   public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
11 
12 }

View Code

该接口提供了在字节层次的比较，从而减少了序列化和反序列化所带来的代价.

该接口的主要子类为WritableComparator，多数情况是作为实现WritableComparable接口的类的内部类，以提供序列化字节的比较：

1.5 WritableComparator类

WritableComparator 类类似于一个注册表，记录了所有 Comparator 类(WritableComparable接口实现类的内部类，它们都继承WritableComparator 类，如 IntWritable 等) 的集合.

  1 package org.apache.hadoop.io;
  2 
  3 import java.io.*;
  4 import java.util.*;
  5 import org.apache.hadoop.util.ReflectionUtils;
  6 
  7 public class WritableComparator implements RawComparator {
  8     
  9   /** comparators变量可看作registry注册器的角色，记载了 WritableComparator 类的集合 */
 10   private static HashMap<Class, WritableComparator> comparators = new HashMap<Class, WritableComparator>(); 
 11 
 12   /** 
 13    *    HashMap 是线程不安全的集合类，所以需要 synchronized 同步，
 14    *  该方法根据key=Class<? extends WritableComparable> 返回对应的
 15    *  WritableComparator 比较器，若返回的是空值 NULL，则调用
 16    *  protected WritableComparator构造函数，之后调用newKey()，
 17    *  new DataInputBuffer() 初始化变量key1，key2，buffer
 18    */
 19   public static synchronized WritableComparator get(Class<? extends WritableComparable> c) {
 20     WritableComparator comparator = comparators.get(c);
 21     if (comparator == null)
 22       comparator = new WritableComparator(c, true);
 23     return comparator;
 24   }
 25 
 26   /** 注册 WritableComparator 对象到注册表中，该方法需要同步 */
 27   public static synchronized void define(Class c, WritableComparator comparator) {
 28     comparators.put(c, comparator);
 29   }
 30 
 31   /** 该变量代表进行比较的 Key 的类型 */
 32   private final Class<? extends WritableComparable> keyClass;
 33   
 34   /** 需要进行比较的两个 Key */
 35   private final WritableComparable key1;
 36   private final WritableComparable key2;
 37   
 38   /** 输入缓冲流 */
 39   private final DataInputBuffer buffer;
 40 
 41   /** Construct for a WritableComparable implementation. */
 42   protected WritableComparator(Class<? extends WritableComparable> keyClass) {
 43     this(keyClass, false);
 44   }
 45     
 46   /** 
 47    * buffer 是记录 HashMap 注册表中对应的 Key 值. keyClass，key1，key2，buffer
 48    * 在该构造函数根据 boolean(createInstances) 来判断是否初始化.
 49    */
 50   protected WritableComparator(Class<? extends WritableComparable> keyClass, boolean createInstances) {
 51     this.keyClass = keyClass;
 52     if (createInstances) {
 53       key1 = newKey();
 54       key2 = newKey();
 55       buffer = new DataInputBuffer();
 56     } else {
 57       key1 = key2 = null;
 58       buffer = null;
 59     }
 60   }
 61 
 62   /** Construct a new {@link WritableComparable} instance. */
 63   public WritableComparable newKey() {
 64     return ReflectionUtils.newInstance(keyClass, null);
 65   }
 66   
 67   /** Returns the WritableComparable implementation class. */
 68   public Class<? extends WritableComparable> getKeyClass() { return keyClass; }
 69 
 70   /** 
 71    * 利用 Buffer 为桥接中介，把字节数组存储为 buffer 后，调用key1，key2(WritableComparable接口的实现)
 72    * 的反序列化方法readFields(DataInput in)，最后比较key1，key2. 即该方法作用是将要比较的二进制流反序
 73    * 列化为对象
 74    */
 75   public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
 76     try {
 77       buffer.reset(b1, s1, l1);                   // parse key1
 78       key1.readFields(buffer);
 79       
 80       buffer.reset(b2, s2, l2);                   // parse key2
 81       key2.readFields(buffer);
 82       
 83     } catch (IOException e) {
 84       throw new RuntimeException(e);
 85     }
 86     
 87     return compare(key1, key2);                   // compare them
 88   }
 89 
 90   @SuppressWarnings("unchecked")
 91   public int compare(WritableComparable a, WritableComparable b) {
 92     return a.compareTo(b);
 93   }
 94 
 95   public int compare(Object a, Object b) {
 96     return compare((WritableComparable)a, (WritableComparable)b);
 97   }
 98 
 99   /** 直接对两个二进制流进行比较 */
100   public static int compareBytes(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
101     int end1 = s1 + l1;
102     int end2 = s2 + l2;
103     for (int i = s1, j = s2; i < end1 && j < end2; i++, j++) {
104       int a = (b1[i] & 0xff);
105       int b = (b2[j] & 0xff);
106       if (a != b) {
107         return a - b;
108       }
109     }
110     return l1 - l2;
111   }
112   
113   /** Compute hash for binary data. */
114   public static int hashBytes(byte[] bytes, int offset, int length) {
115     int hash = 1;
116     for (int i = offset; i < offset + length; i++)
117       hash = (31 * hash) + (int)bytes[i];
118     return hash;
119   }
120   
121   /** Compute hash for binary data. */
122   public static int hashBytes(byte[] bytes, int length) {
123     return hashBytes(bytes, 0, length);
124   }
125     
126   /**
127    * readUnsignedShort，readInt，readFloat，readLong，readDouble，readVLong，readVInt 等
128    * 这些方法用于实现 WritableComparable 的相应实例. 如 IntWritable 实例：其内部类Comparator
129    * 需要根据自己 IntWritable 类型重载父类 WritableComparator 的 compare() 方法，所以在类
130    * WritableComparator 中 compare() 方法只是提供了一个缺省的实现，真正的 compare()方法
131    * 需要根据自己相应的类型，如 IntWritable 等，来进行重载. 所以readInt，readFloat等方法
132    * 只是底层的一个实现，以方便内部类Comparator调用. 
133    */
134     
135   /** Parse an unsigned short from a byte array. */
136   public static int readUnsignedShort(byte[] bytes, int start) {
137     return (((bytes[start]   & 0xff) <<  8) + ((bytes[start+1] & 0xff)));
138   }
139 
140   /** Parse an integer from a byte array. */
141   public static int readInt(byte[] bytes, int start) {
142     return ( ((bytes[start  ] & 0xff) << 24) + ((bytes[start+1] & 0xff) << 16) +
143                      ((bytes[start+2] & 0xff) <<  8) + ((bytes[start+3] & 0xff)) );
144   }
145 
146   /** Parse a float from a byte array. */
147   public static float readFloat(byte[] bytes, int start) {
148     return Float.intBitsToFloat(readInt(bytes, start));
149   }
150 
151   /** Parse a long from a byte array. */
152   public static long readLong(byte[] bytes, int start) {
153     return ((long)(readInt(bytes, start)) << 32) +(readInt(bytes, start+4) & 0xFFFFFFFFL);
154   }
155 
156   /** Parse a double from a byte array. */
157   public static double readDouble(byte[] bytes, int start) {
158     return Double.longBitsToDouble(readLong(bytes, start));
159   }
160 
161   /**
162    * Reads a zero-compressed encoded long from a byte array and returns it.
163    * @param bytes byte array with decode long
164    * @param start starting index
165    * @throws java.io.IOException 
166    * @return deserialized long
167    */
168   public static long readVLong(byte[] bytes, int start) throws IOException {
169     int len = bytes[start];
170     if (len >= -112) {
171       return len;
172     }
173     boolean isNegative = (len < -120);
174     len = isNegative ? -(len + 120) : -(len + 112);
175     if (start+1+len>bytes.length)
176       throw new IOException(
177                             "Not enough number of bytes for a zero-compressed integer");
178     long i = 0;
179     for (int idx = 0; idx < len; idx++) {
180       i = i << 8;
181       i = i | (bytes[start+1+idx] & 0xFF);
182     }
183     return (isNegative ? (i ^ -1L) : i);
184   }
185   
186   /**
187    * Reads a zero-compressed encoded integer from a byte array and returns it.
188    * @param bytes byte array with the encoded integer
189    * @param start start index
190    * @throws java.io.IOException 
191    * @return deserialized integer
192    */
193   public static int readVInt(byte[] bytes, int start) throws IOException {
194     return (int) readVLong(bytes, start);
195   }
196 }

View Code

2 基本数据类型

Hadoop 自带的 org.apache.hadoop.io 包中有广泛的 Writable 类可供选择. 它的层次结构如下图所示：

Hadoop 提供了与Java基本类型所对应的序列化类型实例，如 IntWritable、BooleanWritable、ByteWritable、DoubleWritable、FloatWritable、LongWritable、

NullWritable、Text 等，这些类都实现了 WritableComparable 接口，所以这些类型的数据都是可以序列化，反序列化和比较大小的.

2.1 IntWritable 整型类型

 1 package org.apache.hadoop.io;
 2 
 3 import java.io.*;
 4 
 5 public class IntWritable implements WritableComparable {
 6 
 7   /** IntWritable内部所封装的 int 类型 */
 8   private int value;
 9 
10   public IntWritable() {}
11 
12   public IntWritable(int value) { set(value); }
13 
14   /** Set the value of this IntWritable. */
15   public void set(int value) { this.value = value; }
16 
17   /** Return the value of this IntWritable. */
18   public int get() { return value; }
19 
20   /** 实现 Writable 父接口的序列化和反序列化方法 */
21   public void readFields(DataInput in) throws IOException {
22     value = in.readInt();
23   }
24   public void write(DataOutput out) throws IOException {
25     out.writeInt(value);
26   }
27   
28   /** 针对 IntWritable 的大小比较，重写 equals，hashCode，和 compareTo 方法 */
29   /** Returns true if o is a IntWritable with the same value. */
30   public boolean equals(Object o) {
31     if (!(o instanceof IntWritable))
32       return false;
33     IntWritable other = (IntWritable)o;
34     return this.value == other.value;
35   }
36   public int hashCode() {
37     return value;
38   }
39   /** Compares two IntWritables. */
40   public int compareTo(Object o) {
41     int thisValue = this.value;
42     int thatValue = ((IntWritable)o).value;
43     return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
44   }
45 
46   public String toString() {
47     return Integer.toString(value);
48   }
49 
50   /** IntWritable 的内部优化比较器. */ 
51   public static class Comparator extends WritableComparator {
52     public Comparator() {
53       super(IntWritable.class);
54     }
55     
56     /** 重载父类 WritableComparator 的compare() 方法 */
57     public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
58       int thisValue = readInt(b1, s1);
59       int thatValue = readInt(b2, s2);
60       return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
61     }
62   }
63     
64   // 向 WritableComparator 注册该类型的比较器
65   static {                                        
66     WritableComparator.define(IntWritable.class, new Comparator());
67   }
68 }

View Code

BooleanWritable、ByteWritable、FloatWritable、LongWritable等内部实现和IntWritable类似.

测试 Demo 如下，部分参考了《Hadoop The Definitive Guide》：

(1) 测试基类WritableTestBase，提供一些序列化和反序列化的基本方法

 1 package com.iresearch.hadoop.base;
 2 
 3 import java.io.ByteArrayInputStream;
 4 import java.io.ByteArrayOutputStream;
 5 import java.io.DataInputStream;
 6 import java.io.DataOutputStream;
 7 import java.io.IOException;
 8 
 9 import org.apache.hadoop.io.Writable;
10 import org.apache.hadoop.util.StringUtils;
11 
12 public class WritableTestBase {
13     
14     /**
15      * 将一个实现了 org.apache.hadoop.io.Writable 接口的对象序列化成字节流
16      * 
17      * @param writable
18      * @return byte[]
19      * @throws java.io.IOException
20      */
21     public static byte[] serialize(Writable writable) throws IOException {
22 
23         ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
24         DataOutputStream dataOutputStream = new DataOutputStream(byteArrayOutputStream);
25 
26         writable.write(dataOutputStream);
27         if (null != dataOutputStream) {
28             dataOutputStream.close();
29         }
30 
31         return byteArrayOutputStream.toByteArray();
32     }
33 
34     /**
35      * 将字节流转换为实现了 org.apache.hadoop.io.Writable 接口的对象
36      * 
37      * @param writable
38      * @return byte[]
39      * @throws java.io.IOException
40      */
41     public static byte[] deserialize(Writable writable, byte[] bytes) throws IOException {
42 
43         ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);
44         DataInputStream dataInputStream = new DataInputStream(byteArrayInputStream);
45 
46         writable.readFields(dataInputStream);
47         if (null != dataInputStream) {
48             dataInputStream.close();
49         }
50 
51         return bytes;
52     }
53     
54     /**
55      * 将一个实现了 org.apache.hadoop.io.Writable 接口的对象序列化成字节流, 并返回该字节流的 十六进制 字符串形式
56      * 
57      * @param writable
58      * @return String
59      * @throws java.io.IOException
60      */
61     public static String serializeToHexString(Writable writable) throws IOException{
62         return StringUtils.byteToHexString( serialize(writable) );
63     }
64     
65     /**
66      * 将 一个实现  Writable 接口对象的数据 写入 到另外一个  实现 Writable 接口的对象中
67      * 
68      * @param Writable src  源对象
69      * @param Writable dest 待写入目标对象
70      * @return 待写入字节流的 十六进制 字符串 形式
71      * @throws java.io.IOException
72      */
73     public static String writeTo(Writable src, Writable dest) throws IOException{
74         byte[] bytes = deserialize(dest, serialize(src));
75         return StringUtils.byteToHexString( bytes );
76     }
77 }

View Code

(2) IntWritable 测试Demo

 1 package com.iresearch.hadoop.io;
 2 
 3 import static org.hamcrest.CoreMatchers.is;
 4 import static org.hamcrest.Matchers.greaterThan;
 5 import static org.hamcrest.Matchers.equalTo;
 6 import static org.junit.Assert.assertThat;
 7 
 8 import java.io.IOException;
 9 import java.util.Arrays;
10 
11 import org.apache.hadoop.io.IntWritable;
12 import org.apache.hadoop.io.RawComparator;
13 import org.apache.hadoop.io.WritableComparator;
14 import org.junit.Test;
15 
16 import com.iresearch.hadoop.base.WritableTestBase;
17 
18 
19 public class IntWritableTest extends WritableTestBase {
20     
21     @Test
22     public void testByte(){
23         int a = 5;
24         System.out.println(Integer.toBinaryString(a & 0xFF));
25         System.out.println(Integer.toBinaryString(-a & 0xFF));
26         
27         System.out.println(0xFF);
28         //得出结构： 0xFF默认是 00000000 00000000 00000000 11111111 ，值不是-1
29         int b = -5;
30         b ^= -1L;
31         System.out.println(b);
32     }
33     
34     
35     /**
36      * 测试 IntWritable 序列化数据所占用的字节数
37      */
38     @Test
39     public void checkIntWritableLength() throws IOException{
40         IntWritable writable = new IntWritable(188);      //00000000 00000000 00000000 10111100
41         byte[] data = serialize(writable); //    0        0        0       -68[ 减1, 1011 1011; 取反, 1100 0100; 首位符号, -(2^7 + 2^2) --> -68 ]
42         assertThat(data.length, is(4));                   //说明一个IntWritable 占用四个字节
43         System.out.println(Arrays.toString(data));        //[0, 0, 0, -68]
44     }
45     
46     /**
47      * 测试 输出 序列化后 二进制数据  的 十六进制 字符串 形式
48      */
49     @Test
50     public void checkBytesToString() throws IOException{
51         IntWritable writable = new IntWritable(188);     
52         String bytesStr = serializeToHexString(writable);
53         //00000000 00000000 00000000 10111100  --> 00 00 00 bc
54         assertThat(bytesStr, is("000000bc"));
55         System.out.println(bytesStr);
56     }
57     
58     /**
59      * 测试反序列化
60      */
61     @Test
62     public void checkDeserialize() throws IOException{
63         IntWritable writable = new IntWritable(188);     
64         byte[] bytes = serialize(writable);
65         
66         IntWritable deseriaWritable = new IntWritable();
67         deserialize(deseriaWritable, bytes);
68         assertThat(deseriaWritable.get(), is(188));
69         System.out.println(deseriaWritable.get());
70     }
71     
72     /***
73      * 测试  WritableComparator 比较器在 IntWritable的应用 
74      * 
75      * @throws IOException 
76      */
77     @Test
78     public void checkIntWritableComparator() throws IOException{
79         
80         @SuppressWarnings("unchecked")
81         RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);
82         
83         IntWritable writableA = new IntWritable(188); 
84         IntWritable writableB = new IntWritable(-68);
85         
86         // IntWritable 对象间的比较
87         int compare = comparator.compare(writableA, writableB);
88         assertThat( compare, greaterThan(0) );
89         System.out.println(compare);
90         
91         // IntWritable 对象序列化后字节流的直接比较
92         writableB.set(188);
93         byte[] bytesA = serialize(writableA);
94         byte[] bytesB = serialize(writableB);
95         compare = comparator.compare(bytesA, 0, bytesA.length, bytesB, 0, bytesB.length);
96         assertThat( compare, equalTo(0) );
97         System.out.println(compare);
98     }
99 }

View Code

2.2 VIntWritable 可变长度整形类型

VIntWritable和VLongWritable这两个类源代码基本一样，且VIntWritable的value编码的时候也是使用VLongWritable的value编解码时的方法，

主要区别是VIntWritable对象使用int型value成员，而VLongWritable使用long型value成员，这是由它们的取值范围决定的.它们都没有Comparator比较器类，和其它基本类型有些区别.

它们的序列化大小(字节)如下表所示：

Java基本类型	Writable实现	序列化大小(字节)
boolean	BooleanWritable	1
byte	ByteWritable	1
int	IntWritable VIntWritable	4 1~5
float	FloatWritable	4
long	LongWritable VLongWritable	8 1~9
double	DoubleWritable	8

 1 package org.apache.hadoop.io;
 2 
 3 import java.io.*;
 4 
 5 /** 
 6  *  A WritableComparable for longs in a variable-length format. 
 7  *  Such values take between one and five bytes.  Smaller values 
 8  *  take fewer bytes.
 9  */
10 public class VLongWritable implements WritableComparable {
11   private long value;
12 
13   public VLongWritable() {}
14 
15   public VLongWritable(long value) { set(value); }
16 
17   /** Set the value of this LongWritable. */
18   public void set(long value) { this.value = value; }
19 
20   /** Return the value of this LongWritable. */
21   public long get() { return value; }
22 
23   public void readFields(DataInput in) throws IOException {
24     value = WritableUtils.readVLong(in);
25   }
26 
27   public void write(DataOutput out) throws IOException {
28     WritableUtils.writeVLong(out, value);
29   }
30 
31   /** Returns true if o is a VLongWritable with the same value. */
32   public boolean equals(Object o) {
33     if (!(o instanceof VLongWritable))
34       return false;
35     VLongWritable other = (VLongWritable)o;
36     return this.value == other.value;
37   }
38 
39   public int hashCode() {
40     return (int)value;
41   }
42 
43   /** Compares two VLongWritables. */
44   public int compareTo(Object o) {
45     long thisValue = this.value;
46     long thatValue = ((VLongWritable)o).value;
47     return (thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1));
48   }
49 
50   public String toString() {
51     return Long.toString(value);
52   }
53 
54 }

View Code

在上面可以看到它编码write时使用 WritableUtils.writeVInt(DataOutput stream, int i) 方法. WritableUtils是关于编解码等的工具类，

VIntWritable value的编码实际上是调用 WritableUtils.writeVLong(stream, i) ：

1   public static void writeVInt(DataOutput stream, int i) throws IOException {
2     writeVLong(stream, i);
3   }

首先VIntWritable的长度是[1-5],VLonWritable长度是[1-9]，如果数值在[-112,127]时，使用1Byte(8位二进制)表示，即编码后的1Byte存储的就是这个数值. 如果不是在这个范围内，则需要更多的Byte，而第一个Byte将被用作存储长度，其它Byte存储数值.

负数长度表示(往左依次递减，所表示的字节长度依次递增1~8)									正数长度表示(往左依次递减，所表示的字节长度依次递增1~8)
Dec	-128	-127	-126	-125	-124	-123	-122	-121	-120	-119	-118	-117	-116	-115	-114	-113	[-112，127]
Oct	1000 0000	1000 0001	1000 0010	1000 0011	1000 0100	1000 0101	1000 0110	1000 0111	1000 1000	1000 1001	1000 1010	1000 1011	1000 1100	1000 1101	1000 1110	1000 1111	该范围内，1Byte表示即可
Hx	80	81	82	83	84	85	86	87	88	89	8A	8B	8C	8D	8E	8F	该范围内，1Byte表示即可

WritableUtils.writeVLong(DataOutput stream,long i) 源码解析如下：

 1 public static void writeVLong(DataOutput stream, long i) throws IOException {
 2     
 3     if (i >= -112 && i <= 127) {  //在该范围内的数字，编码后用 1Byte 存储即可.
 4         stream.writeByte((byte)i);
 5         return;
 6     }
 7       
 8     int len = -112; //默认i为正数，此时长度从-112开始计量
 9     
10     if (i < 0) {    //若i为负数
11         i ^= -1L;   //i=i^(-1L) 即与 ...1111 1111异或，相当于对二进制数按位取反
12         len = -120; //i<0，则长度从-120开始计量
13     }
14       
15     long tmp = i;                                           
16     /*
17      *     到这里，i为正数的不变; i为负数的，已对该数的二进制形式进行了按位取反操作，如：
18      *
19      *      -158:     11111111 01100010 
20      *           -1:  ^ 11111111 11111111
21      *      ----------------------------------------
22      *                    00000000 10011101   ==> 157
23     */
24     while (tmp != 0) {
25         tmp = tmp >> 8;
26         len--;           //每右移8位(1个字节)后，len值减1，其表示的意义就是数值长度增1. 当tmp为0时，表示数值长度检验完毕
27     }
28      
29     /*
30      *  先写入表示 i 正负和长度的表示符
31      *   
32      *  正数:[-120, 表示占用8个字节长度
33      *          -119, 表示占用7个字节长度
34      *         -119,
35      *         ...
36      *          -113] 表示占用1个字节长度
37      *         
38      *  负数:[-128, 表示占用8个字节长度
39          *         -127, 表示占用7个字节长度
40      *         -126,
41      *         ...
42      *          -121] 表示占用1个字节长度
43     */
44     stream.writeByte((byte)len);  
45     
46     len = (len < -120) ? -(len + 120) : -(len + 112);  //计算占用几个字节长度
47       
48     for (int idx = len; idx != 0; idx--) {  //将i从高位到低位，依次写入
49         int shiftbits = (idx - 1) * 8;
50         long mask = 0xFFL << shiftbits;     //确保每次左移后，只取相应位置的8位，可以与 mask 相与 '&' 
51         stream.writeByte((byte)((i & mask) >> shiftbits));
52     }
53 }

View Code

再来看看变长的存储数据是怎么读取，WritableUtils.readVLong(DataInput stream) 源码解析如下：

 1 /**
 2 * Reads a zero-compressed encoded long from input stream and returns it.
 3 * @param stream Binary input stream
 4 * @throws java.io.IOException 
 5 * @return deserialized long from stream.
 6 */
 7 public static long readVLong(DataInput stream) throws IOException {
 8     
 9     byte firstByte = stream.readByte();     //读取第一个字节
10     int len = decodeVIntSize(firstByte);    //根据第一个字节判断数据存储的字节长度(包括表示正负和长度的指示符)
11     if (len == 1) {
12         return firstByte;
13     }
14     long i = 0;
15     for (int idx = 0; idx < len-1; idx++) { //遍历读取 DataInput 中的字节数据
16         byte b = stream.readByte();
17         i = i << 8;                            
18         i = i | (b & 0xFF);                 //DataInput 字节流中可能含有负数的情况( [-121, 23, -9, 5]，其中-9是负数 ==> 1111 0111)，避免强转出现oxFFFFFF...的情况
19     }
20     return (isNegativeVInt(firstByte) ? (i ^ -1L) : i);
21 }
22 
23 /**
24 * Parse the first byte of a vint/vlong to determine the number of bytes
25 * @param value the first byte of the vint/vlong
26 * @return the total number of bytes (1 to 9)
27 */
28 public static int decodeVIntSize(byte value) {
29     if (value >= -112) {
30         return 1;
31     } else if (value < -120) {
32         return -119 - value;
33     }
34     return -111 - value;
35 }
36 
37 /**
38 * Given the first byte of a vint/vlong, determine the sign
39 * @param value the first byte
40 * @return is the value negative
41 */
42 public static boolean isNegativeVInt(byte value) {
43     return value < -120 || (value >= -112 && value < 0);
44 }

View Code

(1)VIntWritable 测试Demo

 1 package com.iresearch.hadoop.io;
 2 
 3 import static org.hamcrest.CoreMatchers.is;
 4 import static org.junit.Assert.assertThat;
 5 
 6 import java.io.IOException;
 7 
 8 import org.apache.hadoop.io.VIntWritable;
 9 import org.apache.hadoop.io.VLongWritable;
10 import org.junit.Test;
11 
12 import com.iresearch.hadoop.io.base.WritableTestBase;
13 
14 public class VIntWritableTest extends WritableTestBase {
15 
16     @Test
17     public void testSerialize() throws IOException{
18         
19         VIntWritable vint = new VIntWritable(-259); 
20         byte[] bytes = serialize(vint);
21         System.out.println( serializeToHexString(vint) ); //860102, 2byte
22         
23         VIntWritable vintNew = new VIntWritable();
24         deserialize(vintNew, bytes);
25         
26         System.out.println( vintNew.get() ); //-259
27         
28         System.out.println( serializeToHexString(new VIntWritable(1)) );  //01, 1byte
29         System.out.println( serializeToHexString(new VIntWritable(-112)) ); //90, 1byte
30         System.out.println( serializeToHexString(new VIntWritable(127)) );  //7f, 1byte
31         System.out.println( serializeToHexString(new VIntWritable(128)) );  //8f80, 2byte
32         System.out.println( serializeToHexString(new VIntWritable(163)) );  //8fa3, 2byte
33         System.out.println( serializeToHexString(new VIntWritable(Integer.MAX_VALUE)) ); //8c7fffffff, 5byte
34         System.out.println( serializeToHexString(new VIntWritable(Integer.MIN_VALUE)) ); //847fffffff, 5byte
35         
36         
37         assertThat(serializeToHexString(new VLongWritable(1)), is("01")); // 1 byte
38         assertThat(serializeToHexString(new VLongWritable(127)), is("7f")); // 1 byte
39         assertThat(serializeToHexString(new VLongWritable(128)), is("8f80")); // 2 byte
40         assertThat(serializeToHexString(new VLongWritable(163)), is("8fa3")); // 2 byte
41         assertThat(serializeToHexString(new VLongWritable(Long.MAX_VALUE)), is("887fffffffffffffff")); // 9 byte
42         assertThat(serializeToHexString(new VLongWritable(Long.MIN_VALUE)), is("807fffffffffffffff")); // 9 byte
43     }
44     
45 }

View Code

2.3 Text 文本类型

Text 类是与 Java 的String类型相对应，继承 BinaryComparable 父类，并实现WritableComparable<BinaryComparable>接口. Text 内部使用 UTF-8 的编码方式，

其提供了在字节级别上的序列化、反序列化以及大小比较方法.

  1 package org.apache.hadoop.io;
  2 
  3 import java.io.IOException;
  4 import java.io.DataInput;
  5 import java.io.DataOutput;
  6 import java.nio.ByteBuffer;
  7 import java.nio.CharBuffer;
  8 import java.nio.charset.CharacterCodingException;
  9 import java.nio.charset.Charset;
 10 import java.nio.charset.CharsetDecoder;
 11 import java.nio.charset.CharsetEncoder;
 12 import java.nio.charset.CodingErrorAction;
 13 import java.nio.charset.MalformedInputException;
 14 import java.text.CharacterIterator;
 15 import java.text.StringCharacterIterator;
 16 
 17 import org.apache.commons.logging.Log;
 18 import org.apache.commons.logging.LogFactory;
 19 
 20 public class Text extends BinaryComparable implements WritableComparable<BinaryComparable> {
 21   private static final Log LOG= LogFactory.getLog(Text.class);
 22   
 23   private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY =
 24     new ThreadLocal<CharsetEncoder>() {
 25       protected CharsetEncoder initialValue() {
 26         return Charset.forName("UTF-8").newEncoder().
 27                onMalformedInput(CodingErrorAction.REPORT).
 28                onUnmappableCharacter(CodingErrorAction.REPORT);
 29     }
 30   };
 31   
 32   private static ThreadLocal<CharsetDecoder> DECODER_FACTORY =
 33     new ThreadLocal<CharsetDecoder>() {
 34     protected CharsetDecoder initialValue() {
 35       return Charset.forName("UTF-8").newDecoder().
 36              onMalformedInput(CodingErrorAction.REPORT).
 37              onUnmappableCharacter(CodingErrorAction.REPORT);
 38     }
 39   };
 40   
 41   private static final byte [] EMPTY_BYTES = new byte[0];
 42   
 43   private byte[] bytes;
 44   private int length;
 45 
 46   public Text() {
 47     bytes = EMPTY_BYTES;
 48   }
 49 
 50   /** Construct from a string. */
 51   public Text(String string) { set(string); }
 52 
 53   /** Construct from another text. */
 54   public Text(Text utf8) { set(utf8); }
 55 
 56   /** Construct from a byte array. */
 57   public Text(byte[] utf8)  { set(utf8); }
 58   
 59   /** Returns the raw bytes; however, only data up to {@link #getLength()} is valid. */
 60   public byte[] getBytes() { return bytes; }
 61 
 62   /** Returns the number of bytes in the byte array */ 
 63   public int getLength() { return length; }
 64   
 65   /** 返回指定 position 用于表示 Unicode 代码点的int类型，与 String 对象返回一个 char 类型不同 */
 66   public int charAt(int position) {
 67     if (position > this.length) return -1; // too long
 68     if (position < 0) return -1; // duh.
 69       
 70     ByteBuffer bb = (ByteBuffer)ByteBuffer.wrap(bytes).position(position);
 71     return bytesToCodePoint(bb.slice());
 72   }
 73   
 74   public int find(String what) { return find(what, 0); }
 75   
 76   /** 
 77    * find 方法与 String 的indexOf 方法相对应，用于返回某个子串在 Text 对象
 78    * 所封装的字节数组中所出现的第一位置
 79    */
 80   /**
 81    * Finds any occurence of <code>what</code> in the backing
 82    * buffer, starting as position <code>start</code>. The starting
 83    * position is measured in bytes and the return value is in
 84    * terms of byte position in the buffer. The backing buffer is
 85    * not converted to a string for this operation.
 86    * @return byte position of the first occurence of the search
 87    *         string in the UTF-8 buffer or -1 if not found
 88    */
 89   public int find(String what, int start) {
 90     try {
 91       ByteBuffer src = ByteBuffer.wrap(this.bytes,0,this.length);
 92       ByteBuffer tgt = encode(what);
 93       byte b = tgt.get();
 94       src.position(start);
 95           
 96       while (src.hasRemaining()) {
 97         if (b == src.get()) { // matching first byte
 98           src.mark(); // save position in loop
 99           tgt.mark(); // save position in target
100           boolean found = true;
101           int pos = src.position()-1;
102           while (tgt.hasRemaining()) {
103             if (!src.hasRemaining()) { // src expired first
104               tgt.reset();
105               src.reset();
106               found = false;
107               break;
108             }
109             if (!(tgt.get() == src.get())) {
110               tgt.reset();
111               src.reset();
112               found = false;
113               break; // no match
114             }
115           }
116           if (found) return pos;
117         }
118       }
119       return -1; // not found
120     } catch (CharacterCodingException e) {
121       // can't get here
122       e.printStackTrace();
123       return -1;
124     }
125   }  
126   /** Set to contain the contents of a string. */
127   public void set(String string) {
128     try {
129       ByteBuffer bb = encode(string, true);
130       bytes = bb.array();
131       length = bb.limit();
132     }catch(CharacterCodingException e) {
133       throw new RuntimeException("Should not have happened " + e.toString()); 
134     }
135   }
136 
137   /** Set to a utf8 byte array. */
138   public void set(byte[] utf8) { set(utf8, 0, utf8.length); }
139   
140   /** copy a text. */
141   public void set(Text other) { set(other.getBytes(), 0, other.getLength()); }
142   
143   /** 
144    *  重载的 set 方法，完成对 Text 对象变量的初始化
145    *  setCapacity(int len, boolean keepData) 方法对 Text 对象的 bytes 容量进行赋值，并根据
146    *  boolean keepData来判断是否保存原来 bytes 中数据.
147    *
148    *  System.arraycopy(Object src, int srcPos, Object dest, int destPos, int length)
149    *    src      要复制的数组
150    *    srcPos   从源数组的第几位开始复制
151    *    dest     复制的目标数组
152    *    destPos  复制到目标数组时，从第几位开始存储
153    *    length   要复制的数据长度 
154    */
155   /**
156    * Set the Text to range of bytes
157    * @param utf8 the data to copy from
158    * @param start the first position of the new string
159    * @param len the number of bytes of the new string
160    */
161   public void set(byte[] utf8, int start, int len) {
162     setCapacity(len, false);
163     System.arraycopy(utf8, start, bytes, 0, len);
164     this.length = len;
165   }
166   
167   /** 向 Text 所封装的字节数组末尾添加字节数组 */
168   /**
169    * Append a range of bytes to the end of the given text
170    * @param utf8 the data to copy from
171    * @param start the first position to append from utf8
172    * @param len the number of bytes to append
173    */
174   public void append(byte[] utf8, int start, int len) {
175     setCapacity(length + len, true);
176     System.arraycopy(utf8, start, bytes, length, len);
177     length += len;
178   }
179 
180   /** 清空 Text 的值，将字节的长度设置为0 */
181   /** Clear the string to empty. */
182   public void clear() { length = 0; }
183 
184   /*
185    * Sets the capacity of this Text object to <em>at least</em>
186    * <code>len</code> bytes. If the current buffer is longer,
187    * then the capacity and existing content of the buffer are
188    * unchanged. If <code>len</code> is larger
189    * than the current capacity, the Text object's capacity is
190    * increased to match.
191    * @param len the number of bytes we need
192    * @param keepData should the old data be kept
193    */
194   private void setCapacity(int len, boolean keepData) {
195     if (bytes == null || bytes.length < len) {
196       byte[] newBytes = new byte[len];
197       if (bytes != null && keepData) {
198         System.arraycopy(bytes, 0, newBytes, 0, length);
199       }
200       bytes = newBytes;
201     }
202   }
203    
204   /** 
205    * Convert text back to string
206    * @see java.lang.Object#toString()
207    */
208   public String toString() {
209     try {
210       return decode(bytes, 0, length);
211     } catch (CharacterCodingException e) { 
212       throw new RuntimeException("Should not have happened " + e.toString()); 
213     }
214   }
215   
216   /** 对 Text 对象的序列化和反序列化操作 */
217   /** serialize
218    * write this object to out
219    * length uses zero-compressed encoding
220    * @see Writable#write(DataOutput)
221    */
222   public void write(DataOutput out) throws IOException {
223     WritableUtils.writeVInt(out, length);
224     out.write(bytes, 0, length);
225   }
226   
227   /** deserialize */
228   public void readFields(DataInput in) throws IOException {
229     int newLength = WritableUtils.readVInt(in);
230     setCapacity(newLength, false);
231     in.readFully(bytes, 0, newLength);
232     length = newLength;
233   }
234 
235   /** Skips over one Text in the input. */
236   public static void skip(DataInput in) throws IOException {
237     int length = WritableUtils.readVInt(in);
238     WritableUtils.skipFully(in, length);
239   }
240 
241   /** Returns true iff <code>o</code> is a Text with the same contents.  */
242   public boolean equals(Object o) {
243     if (o instanceof Text)
244       return super.equals(o);
245     return false;
246   }
247 
248   public int hashCode() { return super.hashCode(); }
249 
250   /** A WritableComparator optimized for Text keys. */
251   public static class Comparator extends WritableComparator {
252     public Comparator() {
253       super(Text.class);
254     }
255 
256     public int compare(byte[] b1, int s1, int l1,
257                        byte[] b2, int s2, int l2) {
258       int n1 = WritableUtils.decodeVIntSize(b1[s1]);
259       int n2 = WritableUtils.decodeVIntSize(b2[s2]);
260       return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2);
261     }
262   }
263 
264   static {
265     // register this comparator
266     WritableComparator.define(Text.class, new Comparator());
267   }
268 
269   /// STATIC UTILITIES FROM HERE DOWN
270   /**
271    * Converts the provided byte array to a String using the
272    * UTF-8 encoding. If the input is malformed,
273    * replace by a default value.
274    */
275   public static String decode(byte[] utf8) throws CharacterCodingException {
276     return decode(ByteBuffer.wrap(utf8), true);
277   }
278   
279   public static String decode(byte[] utf8, int start, int length) 
280     throws CharacterCodingException {
281     return decode(ByteBuffer.wrap(utf8, start, length), true);
282   }
283   
284   /** 将 UTF-8 编码的字节数组转化为 String 的不同重载实现 */
285   /**
286    * Converts the provided byte array to a String using the
287    * UTF-8 encoding. If <code>replace</code> is true, then
288    * malformed input is replaced with the
289    * substitution character, which is U+FFFD. Otherwise the
290    * method throws a MalformedInputException.
291    */
292   public static String decode(byte[] utf8, int start, int length, boolean replace) 
293     throws CharacterCodingException {
294     return decode(ByteBuffer.wrap(utf8, start, length), replace);
295   }
296   
297   private static String decode(ByteBuffer utf8, boolean replace) 
298     throws CharacterCodingException {
299     CharsetDecoder decoder = DECODER_FACTORY.get();
300     if (replace) {
301       decoder.onMalformedInput(
302           java.nio.charset.CodingErrorAction.REPLACE);
303       decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
304     }
305     String str = decoder.decode(utf8).toString();
306     // set decoder back to its default value: REPORT
307     if (replace) {
308       decoder.onMalformedInput(CodingErrorAction.REPORT);
309       decoder.onUnmappableCharacter(CodingErrorAction.REPORT);
310     }
311     return str;
312   }
313   
314   /** 使用 UTF-8 的编码方式将 String 转化为字节缓冲(数组)的 不同重载实现 */
315   /**
316    * Converts the provided String to bytes using the
317    * UTF-8 encoding. If the input is malformed,
318    * invalid chars are replaced by a default value.
319    * @return ByteBuffer: bytes stores at ByteBuffer.array() 
320    *                     and length is ByteBuffer.limit()
321    */
322 
323   public static ByteBuffer encode(String string)
324     throws CharacterCodingException {
325     return encode(string, true);
326   }
327 
328   /**
329    * Converts the provided String to bytes using the
330    * UTF-8 encoding. If <code>replace</code> is true, then
331    * malformed input is replaced with the
332    * substitution character, which is U+FFFD. Otherwise the
333    * method throws a MalformedInputException.
334    * @return ByteBuffer: bytes stores at ByteBuffer.array() 
335    *                     and length is ByteBuffer.limit()
336    */
337   public static ByteBuffer encode(String string, boolean replace)
338     throws CharacterCodingException {
339     CharsetEncoder encoder = ENCODER_FACTORY.get();
340     if (replace) {
341       encoder.onMalformedInput(CodingErrorAction.REPLACE);
342       encoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
343     }
344     ByteBuffer bytes = encoder.encode(CharBuffer.wrap(string.toCharArray()));
345     if (replace) {
346       encoder.onMalformedInput(CodingErrorAction.REPORT);
347       encoder.onUnmappableCharacter(CodingErrorAction.REPORT);
348     }
349     return bytes;
350   }
351 
352   /** Read a UTF8 encoded string from in
353    */
354   public static String readString(DataInput in) throws IOException {
355     int length = WritableUtils.readVInt(in);
356     byte [] bytes = new byte[length];
357     in.readFully(bytes, 0, length);
358     return decode(bytes);
359   }
360 
361   /** Write a UTF8 encoded string to out
362    */
363   public static int writeString(DataOutput out, String s) throws IOException {
364     ByteBuffer bytes = encode(s);
365     int length = bytes.limit();
366     WritableUtils.writeVInt(out, length);
367     out.write(bytes.array(), 0, length);
368     return length;
369   }
370 
371   // states for validateUTF8
372   
373   private static final int LEAD_BYTE = 0;
374 
375   private static final int TRAIL_BYTE_1 = 1;
376 
377   private static final int TRAIL_BYTE = 2;
378 
379   /** 
380    * Check if a byte array contains valid utf-8
381    * @param utf8 byte array
382    * @throws MalformedInputException if the byte array contains invalid utf-8
383    */
384   public static void validateUTF8(byte[] utf8) throws MalformedInputException {
385     validateUTF8(utf8, 0, utf8.length);     
386   }
387   
388   /**
389    * Check to see if a byte array is valid utf-8
390    * @param utf8 the array of bytes
391    * @param start the offset of the first byte in the array
392    * @param len the length of the byte sequence
393    * @throws MalformedInputException if the byte array contains invalid bytes
394    */
395   public static void validateUTF8(byte[] utf8, int start, int len)
396     throws MalformedInputException {
397     int count = start;
398     int leadByte = 0;
399     int length = 0;
400     int state = LEAD_BYTE;
401     while (count < start+len) {
402       int aByte = ((int) utf8[count] & 0xFF);
403 
404       switch (state) {
405       case LEAD_BYTE:
406         leadByte = aByte;
407         length = bytesFromUTF8[aByte];
408 
409         switch (length) {
410         case 0: // check for ASCII
411           if (leadByte > 0x7F)
412             throw new MalformedInputException(count);
413           break;
414         case 1:
415           if (leadByte < 0xC2 || leadByte > 0xDF)
416             throw new MalformedInputException(count);
417           state = TRAIL_BYTE_1;
418           break;
419         case 2:
420           if (leadByte < 0xE0 || leadByte > 0xEF)
421             throw new MalformedInputException(count);
422           state = TRAIL_BYTE_1;
423           break;
424         case 3:
425           if (leadByte < 0xF0 || leadByte > 0xF4)
426             throw new MalformedInputException(count);
427           state = TRAIL_BYTE_1;
428           break;
429         default:
430           // too long! Longest valid UTF-8 is 4 bytes (lead + three)
431           // or if < 0 we got a trail byte in the lead byte position
432           throw new MalformedInputException(count);
433         } // switch (length)
434         break;
435 
436       case TRAIL_BYTE_1:
437         if (leadByte == 0xF0 && aByte < 0x90)
438           throw new MalformedInputException(count);
439         if (leadByte == 0xF4 && aByte > 0x8F)
440           throw new MalformedInputException(count);
441         if (leadByte == 0xE0 && aByte < 0xA0)
442           throw new MalformedInputException(count);
443         if (leadByte == 0xED && aByte > 0x9F)
444           throw new MalformedInputException(count);
445         // falls through to regular trail-byte test!!
446       case TRAIL_BYTE:
447         if (aByte < 0x80 || aByte > 0xBF)
448           throw new MalformedInputException(count);
449         if (--length == 0) {
450           state = LEAD_BYTE;
451         } else {
452           state = TRAIL_BYTE;
453         }
454         break;
455       } // switch (state)
456       count++;
457     }
458   }
459 
460   /**
461    * Magic numbers for UTF-8. These are the number of bytes
462    * that <em>follow</em> a given lead byte. Trailing bytes
463    * have the value -1. The values 4 and 5 are presented in
464    * this table, even though valid UTF-8 cannot include the
465    * five and six byte sequences.
466    */
467   static final int[] bytesFromUTF8 =
468   { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
469     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
470     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
471     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
472     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
473     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
474     0, 0, 0, 0, 0, 0, 0,
475     // trail bytes
476     -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
477     -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
478     -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
479     -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1,
480     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
481     1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
482     3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5 };
483 
484   /**
485    * Returns the next code point at the current position in
486    * the buffer. The buffer's position will be incremented.
487    * Any mark set on this buffer will be changed by this method!
488    */
489   public static int bytesToCodePoint(ByteBuffer bytes) {
490     bytes.mark();
491     byte b = bytes.get();
492     bytes.reset();
493     int extraBytesToRead = bytesFromUTF8[(b & 0xFF)];
494     if (extraBytesToRead < 0) return -1; // trailing byte!
495     int ch = 0;
496 
497     switch (extraBytesToRead) {
498     case 5: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
499     case 4: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */
500     case 3: ch += (bytes.get() & 0xFF); ch <<= 6;
501     case 2: ch += (bytes.get() & 0xFF); ch <<= 6;
502     case 1: ch += (bytes.get() & 0xFF); ch <<= 6;
503     case 0: ch += (bytes.get() & 0xFF);
504     }
505     ch -= offsetsFromUTF8[extraBytesToRead];
506 
507     return ch;
508   }
509 
510   
511   static final int offsetsFromUTF8[] =
512   { 0x00000000, 0x00003080,
513     0x000E2080, 0x03C82080, 0xFA082080, 0x82082080 };
514 
515   /**
516    * For the given string, returns the number of UTF-8 bytes
517    * required to encode the string.
518    * @param string text to encode
519    * @return number of UTF-8 bytes required to encode
520    */
521   public static int utf8Length(String string) {
522     CharacterIterator iter = new StringCharacterIterator(string);
523     char ch = iter.first();
524     int size = 0;
525     while (ch != CharacterIterator.DONE) {
526       if ((ch >= 0xD800) && (ch < 0xDC00)) {
527         // surrogate pair?
528         char trail = iter.next();
529         if ((trail > 0xDBFF) && (trail < 0xE000)) {
530           // valid pair
531           size += 4;
532         } else {
533           // invalid pair
534           size += 3;
535           iter.previous(); // rewind one
536         }
537       } else if (ch < 0x80) {
538         size++;
539       } else if (ch < 0x800) {
540         size += 2;
541       } else {
542         // ch < 0x10000, that is, the largest char value
543         size += 3;
544       }
545       ch = iter.next();
546     }
547     return size;
548   }
549 }

View Code

Unicode、UTF-8、Java的同一字符的不同表现形式：

(1)Text 测试Demo

  1 package com.iresearch.hadoop.io;
  2 
  3 import static org.hamcrest.CoreMatchers.is;
  4 //import static org.hamcrest.Matchers.greaterThan;
  5 //import static org.hamcrest.Matchers.lessThan;
  6 import static org.junit.Assert.assertThat;
  7 
  8 import java.io.UnsupportedEncodingException;
  9 import java.nio.ByteBuffer;
 10 
 11 import org.apache.hadoop.io.Text;
 12 import org.junit.Test;
 13 
 14 import com.iresearch.hadoop.io.base.WritableTestBase;
 15 
 16 public class TextTest extends WritableTestBase {
 17     
 18     @Test
 19     public void test(){
 20         
 21         Text text =new Text("hadoop");
 22         
 23         //getLength(), getBytes().length
 24         assertThat( text.getLength(), is(6) );
 25         assertThat( text.getBytes().length, is(6) );
 26         
 27         //charAt()
 28         System.out.println(text.charAt(0)); 
 29         assertThat( text.charAt(0), is((int) 'h') );
 30         assertThat("Out of bounds..", text.charAt(100), is(-1) );
 31         
 32         //find()
 33         assertThat("find a substring from Text", text.find("do"), is(2));
 34         assertThat("find first 'o' ", text.find("o"), is(3));
 35         assertThat("find 'p' from position 5 ", text.find("p",5), is(5));
 36         assertThat("No match", text.find("hive"), is(-1));
 37     }
 38     
 39     //part1: Text 与 java.lang.String 区别之 索引
 40     @Test
 41     public void testIndex() throws UnsupportedEncodingException{
 42         Text text =new Text("\u0041\u00DF\u6771\uD801\uDC00");
 43         String str = new String("\u0041\u00DF\u6771\uD801\uDC00");
 44         
 45         // ^^ step 1: Test for String
 46         assertThat(str.length(), is(5));
 47         assertThat(str.getBytes("UTF-8").length, is(10));
 48           //indexOf(String str)
 49         assertThat(str.indexOf("\u0041"), is(0));
 50         assertThat(str.indexOf("\u00DF"), is(1));
 51         assertThat(str.indexOf("\u6771"), is(2));
 52         assertThat(str.indexOf("\uD801"), is(3));
 53         assertThat(str.indexOf("\uDC00"), is(4));
 54           //charAt(int index)
 55         assertThat(str.charAt(0), is('\u0041'));
 56         assertThat(str.charAt(1), is('\u00DF'));
 57         assertThat(str.charAt(2), is('\u6771'));
 58         assertThat(str.charAt(3), is('\uD801'));
 59         assertThat(str.charAt(4), is('\uDC00'));        
 60           //codePointAt(int index)
 61         assertThat(str.codePointAt(0), is(0x0041));
 62         assertThat(str.codePointAt(1), is(0x00DF));
 63         assertThat(str.codePointAt(2), is(0x6771));
 64         assertThat(str.codePointAt(3), is(0x10400)); // \uD801\uDC00, 这里表示的是一个Unicode编码
 65         // vv step 1: Test for String
 66         
 67         // ^^ step 2: Test for Text
 68         assertThat(text.getLength(), is(10));
 69           //find(String str) 找到该字符的 UTF-8二进制编码在Text对象的字节偏移量
 70         assertThat(text.find("\u0041"), is(0));
 71         assertThat(text.find("\u00DF"), is(1));
 72         assertThat(text.find("\u6771"), is(3));
 73         assertThat(text.find("\uD801"), is(-1)); //在Text对象的 UTF-8编码中 '\uD801\uDC00'是相当于一个候补字符
 74         assertThat(text.find("\uDC00"), is(-1));
 75         assertThat(text.find("\uD801\uDC00"), is(6));
 76         
 77           //charAt(int position)
 78         assertThat(text.charAt(0), is(0x0041));  //该方法和 java.lang.String的codePointAt(int index)类似
 79         assertThat(text.charAt(1), is(0x00DF));
 80         assertThat(text.charAt(2), is(-1));
 81         assertThat(text.charAt(3), is(0x6771));
 82         assertThat(text.charAt(6), is(0x10400));
 83         
 84         System.out.println(text.getLength());       //10
 85         System.out.println(text.getBytes().length); //11
 86         /*
 87          * [position,limit,capacity] ===> [0, 10, 11]
 88          * 
 89          *      public Text(String string) {
 90          *        set(string);
 91          *      }
 92          *      public void set(String string) {
 93          *        try {
 94          *          ByteBuffer bb = encode(string, true);
 95          *          bytes = bb.array();
 96          *          length = bb.limit();
 97          *        }catch(CharacterCodingException e) {
 98          *          throw new RuntimeException("Should not have happened " + e.toString()); 
 99          *        }
100          *      }
101          */
102         System.out.println(str.length());           //5
103         // vv step 2: Test for Text
104     }
105     
106     //part2: 对Text对象的遍历
107     @Test
108     public void testForEachText(){
109         Text text =new Text("\u0041\u00DF\u6771\uD801\uDC00");
110         
111         ByteBuffer buffer = ByteBuffer.wrap(text.getBytes(), 0, text.getLength());
112         int mark;
113         while( buffer.hasRemaining() && (mark=Text.bytesToCodePoint(buffer))!= -1 ) {
114             System.out.println(Integer.toHexString(mark));
115         }
116         //41
117         //df
118         //6771
119         //10400
120     }
121     
122     //part3: Text的易变性
123     @Test
124     public void testMutability(){
125         
126         Text text = new Text("hadoop");            // ==> [104, 97, 100, 111, 111, 112]
127         
128         /* Text 易变性的测试，与所有的Writable接口实现相似，NullWritable除外 */
129         //text.set("hive");
130         //System.out.println(text.getLength());       ==>4
131         //System.out.println(text.getBytes().length); ==>4
132         
133         /* getBytes()方法返回的字节数组长度可能比getLength()长 */
134         text.set(new Text("hive"));                // ==> [104, 105, 118, 101]
135         System.out.println(text.getLength());      // ==>4
136         System.out.println(text.getBytes().length);// ==>6 长度不变 , bytes=[104, 105, 118, 101, 111, 112]
137         
138         /*
139           public void set(String string) {
140             try {
141               ByteBuffer bb = encode(string, true);
142               bytes = bb.array();
143               length = bb.limit();
144             }catch(CharacterCodingException e) {
145               throw new RuntimeException("Should not have happened " + e.toString()); 
146             }
147           }
148 
149           public void set(Text other) {
150             set(other.getBytes(), 0, other.getLength());
151           }
152         
153           public void set(byte[] utf8, int start, int len) {
154             setCapacity(len, false); 
155               //将utf8[104, 105, 118, 101]字节数组，长度为4覆盖到bytes[104, 97, 100, 111, 111, 112]中 ,结果为[104, 105, 118, 101, 111, 112]
156             System.arraycopy(utf8, start, bytes, 0, len);  
157             this.length = len;  //此时this.length = len = 4 
158           } 
159         */
160     }
161 }

View Code

Text类并不像java.lang.String类具有丰富的字符串操作API，所以在多数情况下，需要先将 Text 对象转换成 String对象，通常调用其toString()实现：

 1 assert(new Text("hadoop").toString(), is("hadoop"));
 2 
 3 //toString() 方法源代码
 4 public String toString() {
 5   try {
 6     return decode(bytes, 0, length);
 7   } catch (CharacterCodingException e) { 
 8     throw new RuntimeException("Should not have happened " + e.toString()); 
 9   }
10 }

2.4 BytesWritable 类型

BytesWritable是对二进制数组的封装，它的序列化格式为一个用于指定后面存储数据字节长度的正数域(4个字节)，再跟实际存储的数据字节本身. 如，长度为2的字节数组包含数值3和5，序列化

形成一个4字节的整数(0x00000002)和该数组中的两个字节(03和05). 测试 Demo 如下：

 1 package com.iresearch.hadoop.io;
 2 import static org.hamcrest.CoreMatchers.is;
 3 //import static org.hamcrest.Matchers.greaterThan;
 4 //import static org.hamcrest.Matchers.lessThan;
 5 import static org.junit.Assert.assertThat;
 6 
 7 import java.io.IOException;
 8 import java.util.Arrays;
 9 
10 import org.apache.hadoop.io.BytesWritable;
11 import org.junit.Test;
12 
13 import com.iresearch.hadoop.io.base.WritableTestBase;
14 
15 public class BytesWritableTest extends WritableTestBase {
16     
17     @Test
18     public void test() throws IOException{
19         
20         // 观察 BytesWritable 序列化的数据形式
21         BytesWritable bytesWritable = new BytesWritable(new byte[]{3, 5});
22         System.out.println(Arrays.toString( serialize(bytesWritable) ));     //[0, 0, 0, 2, 3, 5]
23         assertThat(serializeToHexString(bytesWritable), is("000000020305"));
24         
25         // BytesWritable getLength()和getBytes().length 方法的区别
26         bytesWritable.setCapacity(10);                                           
27         assertThat( bytesWritable.getLength(), is(2) );                      //2
28         System.out.println(Arrays.toString( serialize(bytesWritable) ));     //[0, 0, 0, 2, 3, 5]
29         assertThat( bytesWritable.getBytes().length, is(10) );               //10
30         
31         bytesWritable.setCapacity(1);
32         assertThat( bytesWritable.getLength(), is(1) );                      //1
33         System.out.println(Arrays.toString( serialize(bytesWritable) ));     //[0, 0, 0, 1, 3]
34         assertThat( bytesWritable.getBytes().length, is(1) );                 //1
35         /*
36           public BytesWritable(byte[] bytes) {
37             this.bytes = bytes;
38             this.size = bytes.length;
39           }        
40           public void setCapacity(int new_cap) {
41             if (new_cap != getCapacity()) {
42               byte[] new_data = new byte[new_cap];
43               if (new_cap < size) {
44                 size = new_cap;
45               }
46               if (size != 0) {
47                 System.arraycopy(bytes, 0, new_data, 0, size);
48               }
49               bytes = new_data;
50             }
51           }
52         */
53     }
54 }

View Code

2.5 NullWritable 类型

NullWritable 是一个单例对象，因此该类是不可以被修改的. NullWritable 的序列化长度为0(即没有字节被写入流，也没有从流中读出字节，一般被当作占位符来使用)

 1 package org.apache.hadoop.io;
 2 
 3 import java.io.*;
 4 
 5 /** Singleton Writable with no data. */
 6 public class NullWritable implements WritableComparable {
 7 
 8   private static final NullWritable THIS = new NullWritable();
 9   
10   /** NullWritable 的构造函数是私有的，即是一个单例对象，该类是不可以被修改的. */
11   // no public ctor
12   private NullWritable() {}                       
13 
14   /** Returns the single instance of this class. */
15   public static NullWritable get() { return THIS; }
16   
17   public String toString() {
18     return "(null)";
19   }
20 
21   public int hashCode() { return 0; }
22   public int compareTo(Object other) {
23     if (!(other instanceof NullWritable)) {
24       throw new ClassCastException("can't compare " + other.getClass().getName() + " to NullWritable");
25     }
26     return 0;
27   }
28   public boolean equals(Object other) { return other instanceof NullWritable; }
29   public void readFields(DataInput in) throws IOException {}
30   public void write(DataOutput out) throws IOException {}
31 
32   /** A Comparator &quot;optimized&quot; for NullWritable. */
33   public static class Comparator extends WritableComparator {
34     public Comparator() {
35       super(NullWritable.class);
36     }
37 
38     /** Compare the buffers in serialized form. */
39     public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
40       assert 0 == l1;
41       assert 0 == l2;
42       return 0;
43     }
44   }
45   
46   // register this comparator
47   static {                                        
48     WritableComparator.define(NullWritable.class, new Comparator());
49   }
50 }

View Code

2.6 ObjectWritable和GenericWritable

ObjectWritable is a general-purpose wrapper for the following: Java primitives, String, enum, Writable, null, or arrays of any of these types.

It is used in Hadoop RPC to marshal(包装) and unmarshal method arguments and return types. 其实主要的通途就是对多于1个的域组成对象进行序列化. 在对端进行

反序列化的时候用到了 WritableFactory 和 WritableFactories(用来根据类名来生成对象)

(1)ObjectWritable 测试Demo

 1 package com.iresearch.hadoop.io;
 2 import java.io.IOException;
 3 
 4 import org.apache.hadoop.io.ObjectWritable;
 5 import org.apache.hadoop.io.Text;
 6 import org.apache.hadoop.util.StringUtils;
 7 
 8 import com.iresearch.hadoop.io.base.WritableTestBase;
 9 
10 public class ObjectWritableTest extends WritableTestBase {
11 
12     public static void main(String[] args) throws IOException {
13         
14         Text text = new Text("\u0041");
15         ObjectWritable writable = new ObjectWritable(text);
16         System.out.println( StringUtils.byteToHexString( serialize(writable)) );
17         //00196f72672e6170616368652e6861646f6f702e696f2e5465787400196f72672e6170616368652e6861646f6f702e696f2e546578740141
18         //(a)0019 6f72672e6170616368652e6861646f6f702e696f2e54657874， (b)0019 6f72672e6170616368652e6861646f6f702e696f2e54657874，(c)0141
19         /*
20         
21            序列化过程见ObjectWritable的writeObject(DataOutput out, Object instance, Class declaredClass, Configuration conf)方法
22         
23           (1)序列化  ObjectWritable 的声明部分
24              UTF8.writeString(out, declaredClass.getName());  ==>
25              
26              0019 6f72672e6170616368652e6861646f6f702e696f2e54657874(第一部分是一个short数值，为该对象class名字的字符串长度，org.apache.hadoop.io.Text，25位=0x0019)
27           (2)序列化 Writable 接口对象的实现类
28              if (Writable.class.isAssignableFrom(declaredClass)) { // Writable接口实现类
29                 UTF8.writeString(out, instance.getClass().getName());
30                 ((Writable)instance).write(out);
31              }                                                ==>
32              
33              0019 6f72672e6170616368652e6861646f6f702e696f2e54657874
34              0141(可变长Text的序列化值，0x01长度，0x41数值内容)
35          */
36         
37         ObjectWritable srcWritable = new ObjectWritable(Integer.TYPE, 188);
38         ObjectWritable destWritable = new ObjectWritable();
39         cloneInto(srcWritable, destWritable);
40         System.out.println( serializeToHexString(srcWritable) ); //0003696e74000000bc
41         System.out.println((Integer)destWritable.get());         //188
42     }
43 }

View Code

从上述的测试Demo结果，可以看出 ObjectWritable 作为一个通用机制，每次序列化都需要写入封装类型的class名称，这非常浪费空间. GenericWritable 的作用就是在如果封装的类型数量

比较少并且能够提前知道，那么就可以通过使用静态类型的数组，并使用对序列化后的类型引用加入位置索引来提供性能. 我们可以在继承的子类中指定需要支持的类型，案例如下:

(2)GenericWritable 测试Demo

 1 package com.iresearch.hadoop.io;
 2 
 3 import java.io.IOException;
 4 //import java.util.EmptyStackException;
 5 
 6 
 7 import org.apache.hadoop.io.BytesWritable;
 8 import org.apache.hadoop.io.GenericWritable;
 9 import org.apache.hadoop.io.Text;
10 import org.apache.hadoop.io.Writable;
11 import org.apache.hadoop.util.StringUtils;
12 
13 import com.iresearch.hadoop.io.base.WritableTestBase;
14 
15 public class GenericWritableTest extends WritableTestBase{
16     
17     public static void main(String[] args) throws IOException {
18         Text text = new Text("hadoop");
19         MyGenericWritable writable = new MyGenericWritable(text);
20         
21         System.out.println(StringUtils.byteToHexString( serialize(text) ));     //066861646f6f70
22         System.out.println(StringUtils.byteToHexString( serialize(writable) )); //00066861646f6f70 ==> 00，066861646f6f70(org.apache.hadoop.io.Text在classes的第一位)
23         
24         writable.set(new BytesWritable(new byte[]{3,5}));
25         System.out.println(serializeToHexString(writable));                     //01000000020305   ==> 01，000000020305(org.apache.hadoop.io.BytesWritable在classes的第二位)
26         System.out.println( ((BytesWritable)writable.get()).toString() );       //03 05
27         System.out.println( writable.toString() );                              //GW[class=org.apache.hadoop.io.BytesWritable,value=03 05]
28         
29         /*
30           //GenericWritable 对象的set(Writable obj)方法，重置 instance 和 type 的值
31           public void set(Writable obj) {
32             instance = obj;
33             Class<? extends Writable> instanceClazz = instance.getClass();
34             Class<? extends Writable>[] clazzes = getTypes();
35             for (int i = 0; i < clazzes.length; i++) {
36               Class<? extends Writable> clazz = clazzes[i];
37               if (clazz.equals(instanceClazz)) {
38                 type = (byte) i;
39                 return;
40               }
41             }
42             throw new RuntimeException("The type of instance is: " + instance.getClass() + ", which is NOT registered.");
43           }
44           
45           //GenericWritable 序列化方法
46           public void write(DataOutput out) throws IOException {
47             if (type == NOT_SET || instance == null)
48               throw new IOException("The GenericWritable has NOT been set correctly. type=" + type + ", instance=" + instance);
49             out.writeByte(type); //这里type值等于  需要包装的对象在 MyGenericWritable.classes 中的索引位置
50             instance.write(out);
51           }
52 
53         */
54     }
55 }
56 
57 
58 @SuppressWarnings("unchecked")
59 class MyGenericWritable extends GenericWritable {
60     
61     public MyGenericWritable(Writable writable){
62         set(writable);
63     }
64     
65     public static Class<? extends Writable>[] classes = null;
66 
67     static {
68         classes = (Class<? extends Writable>[])new Class[]{
69             Text.class, BytesWritable.class
70         };
71     }
72     
73     @Override
74     protected Class<? extends Writable>[] getTypes() {
75         return classes;
76     }
77     
78 }

View Code

3 集合数据类型

在 org.apache.hadoop.io 包中，有6个 Writable 集合类：ArrayWritable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable, SortedMapWritable 和 EnumSetWritable.

3.1 ArrayWritable 和 TwoDArrayWritable

ArrayWritable 和 TwoDArrayWritable 是对 Writable 的数组和二维数组的实现，ArrayWritable 和 TwoDArrayWritable 中所有元素必须是同一类的实例(在构造函数中指定)，如下所以：

1 ArrayWritable arrayWritable = new ArrayWritable(Text.class);

ArrayWritable 和 TwoDArrayWritable 都有set(), get() 和 toArray()方法，测试Demo如下：

 1 package com.iresearch.hadoop.io;
 2 
 3 import static org.hamcrest.CoreMatchers.is;
 4 import static org.junit.Assert.assertThat;
 5 
 6 import java.io.IOException;
 7 import java.util.Arrays;
 8 
 9 import org.apache.hadoop.io.ArrayWritable;
10 import org.apache.hadoop.io.Text;
11 import org.junit.Test;
12 
13 import com.iresearch.hadoop.io.base.WritableTestBase;
14 
15 public class ArrayWritableTest extends WritableTestBase {
16     
17     @Test
18     public void testArrayWritable() throws IOException{
19         
20         ArrayWritable arrayWritable = new ArrayWritable(Text.class);
21         arrayWritable.set(new Text[]{new Text("hadoop"), new Text("hive")});
22         
23         //先写入表示数组长度的int值，最后依次写入序列化后的存储对象值 0002，6 1049700111111112，4 104105118101
24         System.out.println( Arrays.toString(serialize(arrayWritable)) ); //[0, 0, 0, 2, 6, 104, 97, 100, 111, 111, 112, 4, 104, 105, 118, 101]
25         
26         /*
27           
28           public void readFields(DataInput in) throws IOException {
29             values = new Writable[in.readInt()];          // construct values
30             for (int i = 0; i < values.length; i++) {
31               Writable value = WritableFactories.newInstance(valueClass);
32               value.readFields(in);                       // read a value
33               values[i] = value;                          // store it in values
34             }
35           }
36         
37           public void write(DataOutput out) throws IOException {
38             out.writeInt(values.length);                 // write values
39             for (int i = 0; i < values.length; i++) {
40               values[i].write(out);
41             }
42           }
43         
44         */
45         
46         MyArrayWritable myWritable = new MyArrayWritable();
47         cloneInto(arrayWritable, myWritable);
48         assertThat(myWritable.get().length, is(2));
49         assertThat((Text)myWritable.get()[0], is(new Text("hadoop")));
50         
51         //测试 ArrayWritable 的toArray()方法
52         Text[] textArray = (Text[])myWritable.toArray();
53         System.out.println(textArray[1].toString());  //hive
54     }
55 }
56 
57 class MyArrayWritable extends ArrayWritable{
58 
59     public MyArrayWritable() {
60         super(Text.class);
61     }
62     
63 }

View Code

3.2 MapWritable

MapWritable 和 SortedMapWritable 分别实现了 java.util.Map<Writable, Writable> 和 java.util.SortedMap<WritableComparable, Writable> ，每个键和值使用的类型是相应

字段序列化形成的一部分. 类型存储为单个字节(充当类型数组的索引). 在 org.apache.hadoop.io 包中，数组经常与标准类型和定制的 Writable 类型结合使用，但对于非标准类型，则需要在包

头中指明所使用的数组类型. 根据实现，MapWritable 和 SortedMapWritable 通过正 byte 值(1~127)来指示定制的类型，所以在 MapWritable 和 SortedMapWritalbe 实例中最多可以使用

127个不同的非标准 Writable 类. 测试 Demo 如下：

 1 package com.iresearch.hadoop.io;
 2 
 3 import static org.hamcrest.CoreMatchers.is;
 4 //import static org.hamcrest.Matchers.greaterThan;
 5 //import static org.hamcrest.Matchers.lessThan;
 6 import static org.junit.Assert.assertThat;
 7 
 8 import java.io.IOException;
 9 
10 import org.apache.hadoop.io.BytesWritable;
11 import org.apache.hadoop.io.IntWritable;
12 import org.apache.hadoop.io.MapWritable;
13 import org.apache.hadoop.io.Text;
14 import org.apache.hadoop.io.VIntWritable;
15 import org.junit.Test;
16 
17 import com.iresearch.hadoop.io.base.WritableTestBase;
18 
19 public class MapWritableTest extends WritableTestBase {
20     
21     //测试 MapWritable 中的key
22     @Test
23     public void testKeyInMapWritable() throws IOException{
24         MapWritable mapWritable = new MapWritable();
25         mapWritable.put(new IntWritable(1), new Text("hadoop"));
26         mapWritable.put(new VIntWritable(2), new BytesWritable(new byte[]{3,5}));
27         
28         MapWritable destWritable = new MapWritable();
29         cloneInto(mapWritable, destWritable);
30         assertThat((Text)destWritable.get(new IntWritable(1)), is(new Text("hadoop")));
31         
32         /*
33           assertThat( ((BytesWritable)destWritable.get(new IntWritable(2))).getLength(), is(2)); 
34           ==> 出错，java.lang.NullPointException，说明  MapWritable 是以键的class 类型存储，和实际Writable对象值无关系
35           
36           MapWritable构造函数源码如下：
37           
38             public Writable put(Writable key, Writable value) {
39                addToMap(key.getClass());
40                addToMap(value.getClass());
41                return instance.put(key, value);
42             }
43             
44                           其中addToMap(class clazz)为父类 AbstractMapWritable 的方法：
45             
46             protected synchronized void addToMap(Class clazz) {
47                if (classToIdMap.containsKey(clazz)) { return; }
48                if (newClasses + 1 > Byte.MAX_VALUE) {
49                  throw new IndexOutOfBoundsException("adding an additional class would exceed the maximum number allowed");
50                }
51                byte id = ++newClasses;
52                addToMap(clazz, id);
53             }
54             
55             Map<Class, Byte> classToIdMap = new ConcurrentHashMap<Class, Byte>();
56             Map<Byte, Class> idToClassMap = new ConcurrentHashMap<Byte, Class>();
57             //继承 AbstractMapWritable的MapWritable和SortedMapWritable最多可以使用127个不同的非标准 Writable 类
58             private volatile byte newClasses = 0;  
59             
60             private synchronized void addToMap(Class clazz, byte id) {
61                if (classToIdMap.containsKey(clazz)) {
62                   byte b = classToIdMap.get(clazz);
63                   if (b != id) {
64                      throw new IllegalArgumentException ("Class " + clazz.getName() + " already registered but maps to " + b + " and not " + id);
65                   }
66                }
67                if (idToClassMap.containsKey(id)) {
68                   Class c = idToClassMap.get(id);
69                   if (!c.equals(clazz)) {
70                      throw new IllegalArgumentException("Id " + id + " exists but maps to " + c.getName() + " and not " + clazz.getName());
71                   }
72                }
73                classToIdMap.put(clazz, id);
74                idToClassMap.put(id, clazz);
75             }
76                           
77         */
78         assertThat( ((BytesWritable)destWritable.get(new VIntWritable(2))).getLength(), is(2));
79     }
80     
81     //测试 MapWritable 的序列化过程，序列化过程见 图3.2
82     @Test
83     public void testSerialize() throws IOException{
84         
85         MapWritable mapWritable = new MapWritable();
86         mapWritable.put(new IntWritable(1), new Text("hadoop"));
87         mapWritable.put(new VIntWritable(2), new BytesWritable(new byte[]{3,5}));
88         
89         //000000000285000000018c066861646f6f708e0283000000020305
90         //00, 00000002, 85, 00000001, 8c, 06 6861646f6f70, 8e, 02, 83, 00000002 35
91         System.out.println(serializeToHexString(mapWritable));
92     }
93 }

View Code

MapWritable的序列化过程源码图3.2：

4 实现定制的 Writable 类型

Hadoop 的大部分 Writable 实现能够满足我们的大部分需求，但是有时为了需求需定制一些新的实现. 有了定制的 Writable，我们可以完全控制二进制的表示和排序顺序，由于Writable 是MapReduce 数据路径

的核心，所以调整二进制表示能对性能产生显著效果. 下面有一个定制的 Writable 类型：

  1 package com.iresearch.hadoop.io;
  2 
  3 import java.io.DataInput;
  4 import java.io.DataOutput;
  5 import java.io.IOException;
  6 
  7 import org.apache.commons.lang.ArrayUtils;
  8 import org.apache.hadoop.io.RawComparator;
  9 import org.apache.hadoop.io.Text;
 10 import org.apache.hadoop.io.WritableComparable;
 11 import org.apache.hadoop.io.WritableComparator;
 12 import org.apache.hadoop.io.WritableUtils;
 13 
 14 import com.iresearch.hadoop.io.base.WritableTestBase;
 15 
 16 public class CustomTextWritable implements WritableComparable<CustomTextWritable> {
 17     
 18     private Text first;
 19     private Text second;
 20     
 21     public CustomTextWritable(){
 22         set(new Text(), new Text());
 23     }
 24     
 25     public CustomTextWritable(Text first, Text second){
 26         set(first, second);
 27     }
 28     
 29     public CustomTextWritable(String first, String second){
 30         set(new Text(first), new Text(second));
 31     }
 32     
 33     public void set(Text first, Text second) {
 34         this.first = first;
 35         this.second = second;
 36     }
 37     
 38     public Text getFirst(){
 39         return first;
 40     }
 41     
 42     public Text getSecond(){
 43         return second;
 44     }
 45     
 46     public byte[] getBytes(){
 47         return ArrayUtils.addAll(first.getBytes(), second.getBytes());
 48     }
 49     
 50     public int getLength(){
 51         return first.getLength() + second.getLength();
 52     }
 53     
 54     @Override
 55     public void write(DataOutput out) throws IOException {
 56         first.write(out);
 57         second.write(out);
 58     }
 59 
 60     @Override
 61     public void readFields(DataInput in) throws IOException {
 62         first.readFields(in);
 63         second.readFields(in);
 64     }
 65 
 66     @Override
 67     public int hashCode() {
 68         return first.hashCode() * 163 + second.hashCode();
 69     }
 70 
 71     @Override
 72     public boolean equals(Object obj) {
 73         if(obj instanceof CustomTextWritable){
 74             return first.equals( ((CustomTextWritable)obj).first ) && second.equals( ((CustomTextWritable)obj).second );
 75         }
 76         return false;
 77     }
 78 
 79     @Override
 80     public String toString() {
 81         return first.toString() + "\t" +second.toString();
 82     }
 83 
 84     @Override
 85     public int compareTo(CustomTextWritable other) {
 86         int cmp = first.compareTo(other.first);
 87         if(cmp != 0){
 88             return cmp;
 89         }
 90         return second.compareTo(other.second);
 91     }
 92     
 93     // the default comparator of CustomTextWritable
 94     public static class Comparator extends WritableComparator{
 95         
 96         private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
 97         
 98         protected Comparator() {
 99             super(CustomTextWritable.class);
100         }
101 
102         @Override
103         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
104             
105             try {
106                 //WritableUtils.decodeVIntSize(b1[s1]) 表示 first数据存储长度   数值的字节长度
107                 //readVInt(b1, s1) 表示 first数据存储字节长度
108                 int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
109                 int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
110                 
111                 //compare the first field,  ERROR ==> int cmp = TEXT_COMPARATOR.compare(b1, s1, l1, b2, s2, l2); 
112                 int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
113                 
114                 if(cmp != 0){
115                     return cmp;
116                 }
117                 
118                 // first field is same, then compare the second field.
119                 return TEXT_COMPARATOR.compare(b1, s1+firstL1, l1-firstL1, b2, s2+firstL2, l2-firstL2);
120             } catch (IOException e) {
121                 throw new IllegalArgumentException(e);
122             }
123             
124         }
125         
126     }
127     
128     static{
129         WritableComparator.define(CustomTextWritable.class, new Comparator());
130     }
131     
132     //A custom RawComparator for comparing the first field of CustomTextWritable
133     public static class FirstComparator extends WritableComparator{
134         
135         private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
136         
137         protected FirstComparator() {
138             super(CustomTextWritable.class);
139         }
140         
141         //序列化后，字节的直接比较方法
142         @Override
143         public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
144             
145             try {
146                 int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
147                 int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
148                 
149                 return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
150             } catch (Exception e) {
151                 throw new IllegalArgumentException(e);
152             }
153         }
154         
155         //在非序列化时，对象的比较方法
156         @Override
157         public int compare(WritableComparable a, WritableComparable b) {
158             
159             if(a instanceof CustomTextWritable && b instanceof CustomTextWritable){
160                 return ( ((CustomTextWritable)a).first.compareTo( ((CustomTextWritable)b).first) );
161             }
162             return super.compare(a, b);
163         }
164     }
165 }
166 
167 class MainTest extends WritableTestBase{
168     
169     public static void main(String[] args) throws IOException {
170         
171         CustomTextWritable writableA = new CustomTextWritable("hadoop","hive");
172         CustomTextWritable writableB = new CustomTextWritable("hadoop","hive");
173         
174         @SuppressWarnings("unchecked")
175         RawComparator<CustomTextWritable> comparator = WritableComparator.get(CustomTextWritable.class);
176         //int compare = comparator.compare(writableA, writableB);
177         
178         byte[] bytesA = serialize(writableA);
179         byte[] bytesB = serialize(writableB);
180         int compare = comparator.compare(bytesA, 0, bytesA.length, bytesB, 0, bytesB.length);
181         
182         System.out.println(signum(compare));
183         
184     }
185     
186     public static int signum(int a){
187         return (a<0)? -1 : ( (a==0)?0:1 );
188     }
189 }