由于Hadoop的MapReduce和HDFS都有通信的需求,所以需要对通信的对象进行序列化. Hadoop并没有采用Java的序列化,而是引入了它自己的序列化系统.
org.apache.hadoop.io包中定义了大量的可序列化对象,这些对象都实现了 Writable 接口. Writable 接口是序列化对象的一个通用接口.
1 数据类型接口
1.1 Writable接口
所有实现了 writable 接口的类都可以被序列化和反序列化,典型例子如下所示:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 public class MyWritable implements Writable {
2 // Some data
3 private int counter;
4 private long timestamp;
5
6 //实现Writable接口的序列化方法
7 public void write(DataOutput out) throws IOException {
8 out.writeInt(counter);
9 out.writeLong(timestamp);
10 }
11
12 //实现Writable接口的反序列化方法
13 public void readFields(DataInput in) throws IOException {
14 counter = in.readInt();
15 timestamp = in.readLong();
16 }
17
18 public static MyWritable read(DataInput in) throws IOException {
19 MyWritable w = new MyWritable();
20 w.readFields(in);
21 return w;
22 }
23 }
1.2 Comparable接口
Comparable 是jdk中java.lang包下的接口,所有实现了 Comparable 接口的对象都可以和自身相同类型的对象进行比较. 该接口只有一个方法:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 public interface Comparable<T> {
2 public int compareTo(T o);
3 }
该方法将自身对象 this 和待比较对象o比较. 若小于对象o,返回负数;若等于,返回0;若大于对象o,返回正数.
备注:该接口和 java.util.Comparator 接口的区别
(1)类的设计师没有考虑到比较问题而没有实现 Comparable,可以通过 Comparator 来实现排序而不必改变对象本身
(2)Comparable 用作默认的比较方式,Comparator 用作自定义的比较方式. 当默认的比较方式不适用时或者没有提供默认的比较方式,使用Comparator就非常有用
(3)可以使用多种排序标准,比如升序、降序等
(4)像 Arrays 和 Collections 中的排序方法,当不指定Comparator时使用的就是默认排序方式,也就是使用Comparable,指定Comparator时就是使用提供的比较器
Arrays.sort(Object[]) --> 所有的待比较对象都必须实现 Comparable 接口,它用来确定对象之间的大小关系
Arrays.sort(Object[], Comparator) --> 待比较对象不必实现 Comparable 接口,由 Comparator 来确定对象之间的大小关系
1.3 WritableComparable接口
该接口继承 Writable 和 Comparable 接口:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 //WritableComparable 接口源码
2 public interface WritableComparable<T> extends Writable, Comparable<T> {}
3
4
5
6 //实现 WritableComparable 接口的Demo
7 public class MyWritableComparable implements WritableComparable {
8 // Some data
9 private int counter;
10 private long timestamp;
11
12 public void write(DataOutput out) throws IOException {
13 out.writeInt(counter);
14 out.writeLong(timestamp);
15 }
16
17 public void readFields(DataInput in) throws IOException {
18 counter = in.readInt();
19 timestamp = in.readLong();
20 }
21
22 public int compareTo(MyWritableComparable w) {
23 int thisValue = this.value;
24 int thatValue = ((IntWritable)o).value;
25 return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
26 }
27 }
1.4 RawComparator接口
Hadoop在 MapReduce 过程中,类型的比较是很重要的,如:在排序阶段中 Key 和 Key 的比较等. RawComparator 接口就是为了优化该过程,实现该接口后可以直接比较
数据流中的记录,而无需再反序列化数据流中的数据了.
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package org.apache.hadoop.io;
2
3 import java.util.Comparator;
4
5 import org.apache.hadoop.io.serializer.DeserializerComparator;
6
7 //注意是java.util.Comparator接口,不是java.lang.Comparable接口
8 public interface RawComparator<T> extends Comparator<T> {
9
10 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
11
12 }
该接口提供了在字节层次的比较,从而减少了序列化和反序列化所带来的代价.
该接口的主要子类为WritableComparator,多数情况是作为实现WritableComparable接口的类的内部类,以提供序列化字节的比较:
1.5 WritableComparator类
WritableComparator 类类似于一个注册表,记录了所有 Comparator 类(WritableComparable接口实现类的内部类,它们都继承WritableComparator 类,如 IntWritable 等) 的集合.
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package org.apache.hadoop.io;
2
3 import java.io.*;
4 import java.util.*;
5 import org.apache.hadoop.util.ReflectionUtils;
6
7 public class WritableComparator implements RawComparator {
8
9 /** comparators变量可看作registry注册器的角色,记载了 WritableComparator 类的集合 */
10 private static HashMap<Class, WritableComparator> comparators = new HashMap<Class, WritableComparator>();
11
12 /**
13 * HashMap 是线程不安全的集合类,所以需要 synchronized 同步,
14 * 该方法根据key=Class<? extends WritableComparable> 返回对应的
15 * WritableComparator 比较器,若返回的是空值 NULL,则调用
16 * protected WritableComparator构造函数,之后调用newKey(),
17 * new DataInputBuffer() 初始化变量key1,key2,buffer
18 */
19 public static synchronized WritableComparator get(Class<? extends WritableComparable> c) {
20 WritableComparator comparator = comparators.get(c);
21 if (comparator == null)
22 comparator = new WritableComparator(c, true);
23 return comparator;
24 }
25
26 /** 注册 WritableComparator 对象到注册表中,该方法需要同步 */
27 public static synchronized void define(Class c, WritableComparator comparator) {
28 comparators.put(c, comparator);
29 }
30
31 /** 该变量代表进行比较的 Key 的类型 */
32 private final Class<? extends WritableComparable> keyClass;
33
34 /** 需要进行比较的两个 Key */
35 private final WritableComparable key1;
36 private final WritableComparable key2;
37
38 /** 输入缓冲流 */
39 private final DataInputBuffer buffer;
40
41 /** Construct for a WritableComparable implementation. */
42 protected WritableComparator(Class<? extends WritableComparable> keyClass) {
43 this(keyClass, false);
44 }
45
46 /**
47 * buffer 是记录 HashMap 注册表中对应的 Key 值. keyClass,key1,key2,buffer
48 * 在该构造函数根据 boolean(createInstances) 来判断是否初始化.
49 */
50 protected WritableComparator(Class<? extends WritableComparable> keyClass, boolean createInstances) {
51 this.keyClass = keyClass;
52 if (createInstances) {
53 key1 = newKey();
54 key2 = newKey();
55 buffer = new DataInputBuffer();
56 } else {
57 key1 = key2 = null;
58 buffer = null;
59 }
60 }
61
62 /** Construct a new {@link WritableComparable} instance. */
63 public WritableComparable newKey() {
64 return ReflectionUtils.newInstance(keyClass, null);
65 }
66
67 /** Returns the WritableComparable implementation class. */
68 public Class<? extends WritableComparable> getKeyClass() { return keyClass; }
69
70 /**
71 * 利用 Buffer 为桥接中介,把字节数组存储为 buffer 后,调用key1,key2(WritableComparable接口的实现)
72 * 的反序列化方法readFields(DataInput in),最后比较key1,key2. 即该方法作用是将要比较的二进制流反序
73 * 列化为对象
74 */
75 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
76 try {
77 buffer.reset(b1, s1, l1); // parse key1
78 key1.readFields(buffer);
79
80 buffer.reset(b2, s2, l2); // parse key2
81 key2.readFields(buffer);
82
83 } catch (IOException e) {
84 throw new RuntimeException(e);
85 }
86
87 return compare(key1, key2); // compare them
88 }
89
90 @SuppressWarnings("unchecked")
91 public int compare(WritableComparable a, WritableComparable b) {
92 return a.compareTo(b);
93 }
94
95 public int compare(Object a, Object b) {
96 return compare((WritableComparable)a, (WritableComparable)b);
97 }
98
99 /** 直接对两个二进制流进行比较 */
100 public static int compareBytes(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
101 int end1 = s1 + l1;
102 int end2 = s2 + l2;
103 for (int i = s1, j = s2; i < end1 && j < end2; i++, j++) {
104 int a = (b1[i] & 0xff);
105 int b = (b2[j] & 0xff);
106 if (a != b) {
107 return a - b;
108 }
109 }
110 return l1 - l2;
111 }
112
113 /** Compute hash for binary data. */
114 public static int hashBytes(byte[] bytes, int offset, int length) {
115 int hash = 1;
116 for (int i = offset; i < offset + length; i++)
117 hash = (31 * hash) + (int)bytes[i];
118 return hash;
119 }
120
121 /** Compute hash for binary data. */
122 public static int hashBytes(byte[] bytes, int length) {
123 return hashBytes(bytes, 0, length);
124 }
125
126 /**
127 * readUnsignedShort,readInt,readFloat,readLong,readDouble,readVLong,readVInt 等
128 * 这些方法用于实现 WritableComparable 的相应实例. 如 IntWritable 实例:其内部类Comparator
129 * 需要根据自己 IntWritable 类型重载父类 WritableComparator 的 compare() 方法,所以在类
130 * WritableComparator 中 compare() 方法只是提供了一个缺省的实现,真正的 compare()方法
131 * 需要根据自己相应的类型,如 IntWritable 等,来进行重载. 所以readInt,readFloat等方法
132 * 只是底层的一个实现,以方便内部类Comparator调用.
133 */
134
135 /** Parse an unsigned short from a byte array. */
136 public static int readUnsignedShort(byte[] bytes, int start) {
137 return (((bytes[start] & 0xff) << 8) + ((bytes[start+1] & 0xff)));
138 }
139
140 /** Parse an integer from a byte array. */
141 public static int readInt(byte[] bytes, int start) {
142 return ( ((bytes[start ] & 0xff) << 24) + ((bytes[start+1] & 0xff) << 16) +
143 ((bytes[start+2] & 0xff) << 8) + ((bytes[start+3] & 0xff)) );
144 }
145
146 /** Parse a float from a byte array. */
147 public static float readFloat(byte[] bytes, int start) {
148 return Float.intBitsToFloat(readInt(bytes, start));
149 }
150
151 /** Parse a long from a byte array. */
152 public static long readLong(byte[] bytes, int start) {
153 return ((long)(readInt(bytes, start)) << 32) +(readInt(bytes, start+4) & 0xFFFFFFFFL);
154 }
155
156 /** Parse a double from a byte array. */
157 public static double readDouble(byte[] bytes, int start) {
158 return Double.longBitsToDouble(readLong(bytes, start));
159 }
160
161 /**
162 * Reads a zero-compressed encoded long from a byte array and returns it.
163 * @param bytes byte array with decode long
164 * @param start starting index
165 * @throws java.io.IOException
166 * @return deserialized long
167 */
168 public static long readVLong(byte[] bytes, int start) throws IOException {
169 int len = bytes[start];
170 if (len >= -112) {
171 return len;
172 }
173 boolean isNegative = (len < -120);
174 len = isNegative ? -(len + 120) : -(len + 112);
175 if (start+1+len>bytes.length)
176 throw new IOException(
177 "Not enough number of bytes for a zero-compressed integer");
178 long i = 0;
179 for (int idx = 0; idx < len; idx++) {
180 i = i << 8;
181 i = i | (bytes[start+1+idx] & 0xFF);
182 }
183 return (isNegative ? (i ^ -1L) : i);
184 }
185
186 /**
187 * Reads a zero-compressed encoded integer from a byte array and returns it.
188 * @param bytes byte array with the encoded integer
189 * @param start start index
190 * @throws java.io.IOException
191 * @return deserialized integer
192 */
193 public static int readVInt(byte[] bytes, int start) throws IOException {
194 return (int) readVLong(bytes, start);
195 }
196 }
2 基本数据类型
Hadoop 自带的 org.apache.hadoop.io 包中有广泛的 Writable 类可供选择. 它的层次结构如下图所示:
Hadoop 提供了与Java基本类型所对应的序列化类型实例,如 IntWritable、BooleanWritable、ByteWritable、DoubleWritable、FloatWritable、LongWritable、
NullWritable、Text 等,这些类都实现了 WritableComparable 接口,所以这些类型的数据都是可以序列化,反序列化和比较大小的.
2.1 IntWritable 整型类型
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package org.apache.hadoop.io;
2
3 import java.io.*;
4
5 public class IntWritable implements WritableComparable {
6
7 /** IntWritable内部所封装的 int 类型 */
8 private int value;
9
10 public IntWritable() {}
11
12 public IntWritable(int value) { set(value); }
13
14 /** Set the value of this IntWritable. */
15 public void set(int value) { this.value = value; }
16
17 /** Return the value of this IntWritable. */
18 public int get() { return value; }
19
20 /** 实现 Writable 父接口的序列化和反序列化方法 */
21 public void readFields(DataInput in) throws IOException {
22 value = in.readInt();
23 }
24 public void write(DataOutput out) throws IOException {
25 out.writeInt(value);
26 }
27
28 /** 针对 IntWritable 的大小比较,重写 equals,hashCode,和 compareTo 方法 */
29 /** Returns true if o is a IntWritable with the same value. */
30 public boolean equals(Object o) {
31 if (!(o instanceof IntWritable))
32 return false;
33 IntWritable other = (IntWritable)o;
34 return this.value == other.value;
35 }
36 public int hashCode() {
37 return value;
38 }
39 /** Compares two IntWritables. */
40 public int compareTo(Object o) {
41 int thisValue = this.value;
42 int thatValue = ((IntWritable)o).value;
43 return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
44 }
45
46 public String toString() {
47 return Integer.toString(value);
48 }
49
50 /** IntWritable 的内部优化比较器. */
51 public static class Comparator extends WritableComparator {
52 public Comparator() {
53 super(IntWritable.class);
54 }
55
56 /** 重载父类 WritableComparator 的compare() 方法 */
57 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
58 int thisValue = readInt(b1, s1);
59 int thatValue = readInt(b2, s2);
60 return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
61 }
62 }
63
64 // 向 WritableComparator 注册该类型的比较器
65 static {
66 WritableComparator.define(IntWritable.class, new Comparator());
67 }
68 }
BooleanWritable、ByteWritable、FloatWritable、LongWritable等内部实现和IntWritable类似.
测试 Demo 如下,部分参考了《Hadoop The Definitive Guide》:
(1) 测试基类WritableTestBase,提供一些序列化和反序列化的基本方法
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package com.iresearch.hadoop.base;
2
3 import java.io.ByteArrayInputStream;
4 import java.io.ByteArrayOutputStream;
5 import java.io.DataInputStream;
6 import java.io.DataOutputStream;
7 import java.io.IOException;
8
9 import org.apache.hadoop.io.Writable;
10 import org.apache.hadoop.util.StringUtils;
11
12 public class WritableTestBase {
13
14 /**
15 * 将一个实现了 org.apache.hadoop.io.Writable 接口的对象序列化成字节流
16 *
17 * @param writable
18 * @return byte[]
19 * @throws java.io.IOException
20 */
21 public static byte[] serialize(Writable writable) throws IOException {
22
23 ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
24 DataOutputStream dataOutputStream = new DataOutputStream(byteArrayOutputStream);
25
26 writable.write(dataOutputStream);
27 if (null != dataOutputStream) {
28 dataOutputStream.close();
29 }
30
31 return byteArrayOutputStream.toByteArray();
32 }
33
34 /**
35 * 将字节流转换为实现了 org.apache.hadoop.io.Writable 接口的对象
36 *
37 * @param writable
38 * @return byte[]
39 * @throws java.io.IOException
40 */
41 public static byte[] deserialize(Writable writable, byte[] bytes) throws IOException {
42
43 ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);
44 DataInputStream dataInputStream = new DataInputStream(byteArrayInputStream);
45
46 writable.readFields(dataInputStream);
47 if (null != dataInputStream) {
48 dataInputStream.close();
49 }
50
51 return bytes;
52 }
53
54 /**
55 * 将一个实现了 org.apache.hadoop.io.Writable 接口的对象序列化成字节流, 并返回该字节流的 十六进制 字符串形式
56 *
57 * @param writable
58 * @return String
59 * @throws java.io.IOException
60 */
61 public static String serializeToHexString(Writable writable) throws IOException{
62 return StringUtils.byteToHexString( serialize(writable) );
63 }
64
65 /**
66 * 将 一个实现 Writable 接口对象的数据 写入 到另外一个 实现 Writable 接口的对象中
67 *
68 * @param Writable src 源对象
69 * @param Writable dest 待写入目标对象
70 * @return 待写入字节流的 十六进制 字符串 形式
71 * @throws java.io.IOException
72 */
73 public static String writeTo(Writable src, Writable dest) throws IOException{
74 byte[] bytes = deserialize(dest, serialize(src));
75 return StringUtils.byteToHexString( bytes );
76 }
77 }
(2) IntWritable 测试Demo
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package com.iresearch.hadoop.io;
2
3 import static org.hamcrest.CoreMatchers.is;
4 import static org.hamcrest.Matchers.greaterThan;
5 import static org.hamcrest.Matchers.equalTo;
6 import static org.junit.Assert.assertThat;
7
8 import java.io.IOException;
9 import java.util.Arrays;
10
11 import org.apache.hadoop.io.IntWritable;
12 import org.apache.hadoop.io.RawComparator;
13 import org.apache.hadoop.io.WritableComparator;
14 import org.junit.Test;
15
16 import com.iresearch.hadoop.base.WritableTestBase;
17
18
19 public class IntWritableTest extends WritableTestBase {
20
21 @Test
22 public void testByte(){
23 int a = 5;
24 System.out.println(Integer.toBinaryString(a & 0xFF));
25 System.out.println(Integer.toBinaryString(-a & 0xFF));
26
27 System.out.println(0xFF);
28 //得出结构: 0xFF默认是 00000000 00000000 00000000 11111111 ,值不是-1
29 int b = -5;
30 b ^= -1L;
31 System.out.println(b);
32 }
33
34
35 /**
36 * 测试 IntWritable 序列化数据所占用的字节数
37 */
38 @Test
39 public void checkIntWritableLength() throws IOException{
40 IntWritable writable = new IntWritable(188); //00000000 00000000 00000000 10111100
41 byte[] data = serialize(writable); // 0 0 0 -68[ 减1, 1011 1011; 取反, 1100 0100; 首位符号, -(2^7 + 2^2) --> -68 ]
42 assertThat(data.length, is(4)); //说明一个IntWritable 占用四个字节
43 System.out.println(Arrays.toString(data)); //[0, 0, 0, -68]
44 }
45
46 /**
47 * 测试 输出 序列化后 二进制数据 的 十六进制 字符串 形式
48 */
49 @Test
50 public void checkBytesToString() throws IOException{
51 IntWritable writable = new IntWritable(188);
52 String bytesStr = serializeToHexString(writable);
53 //00000000 00000000 00000000 10111100 --> 00 00 00 bc
54 assertThat(bytesStr, is("000000bc"));
55 System.out.println(bytesStr);
56 }
57
58 /**
59 * 测试反序列化
60 */
61 @Test
62 public void checkDeserialize() throws IOException{
63 IntWritable writable = new IntWritable(188);
64 byte[] bytes = serialize(writable);
65
66 IntWritable deseriaWritable = new IntWritable();
67 deserialize(deseriaWritable, bytes);
68 assertThat(deseriaWritable.get(), is(188));
69 System.out.println(deseriaWritable.get());
70 }
71
72 /***
73 * 测试 WritableComparator 比较器在 IntWritable的应用
74 *
75 * @throws IOException
76 */
77 @Test
78 public void checkIntWritableComparator() throws IOException{
79
80 @SuppressWarnings("unchecked")
81 RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);
82
83 IntWritable writableA = new IntWritable(188);
84 IntWritable writableB = new IntWritable(-68);
85
86 // IntWritable 对象间的比较
87 int compare = comparator.compare(writableA, writableB);
88 assertThat( compare, greaterThan(0) );
89 System.out.println(compare);
90
91 // IntWritable 对象序列化后字节流的直接比较
92 writableB.set(188);
93 byte[] bytesA = serialize(writableA);
94 byte[] bytesB = serialize(writableB);
95 compare = comparator.compare(bytesA, 0, bytesA.length, bytesB, 0, bytesB.length);
96 assertThat( compare, equalTo(0) );
97 System.out.println(compare);
98 }
99 }
2.2 VIntWritable 可变长度整形类型
VIntWritable和VLongWritable这两个类源代码基本一样,且VIntWritable的value编码的时候也是使用VLongWritable的value编解码时的方法,
主要区别是VIntWritable对象使用int型value成员,而VLongWritable使用long型value成员,这是由它们的取值范围决定的.它们都没有Comparator比较器类,和其它基本类型有些区别.
它们的序列化大小(字节)如下表所示:
Java基本类型 | Writable实现 | 序列化大小(字节) |
boolean | BooleanWritable | 1 |
byte | ByteWritable | 1 |
int | IntWritable VIntWritable | 4 1~5 |
float | FloatWritable | 4 |
long | LongWritable VLongWritable | 8 1~9 |
double | DoubleWritable | 8 |
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package org.apache.hadoop.io;
2
3 import java.io.*;
4
5 /**
6 * A WritableComparable for longs in a variable-length format.
7 * Such values take between one and five bytes. Smaller values
8 * take fewer bytes.
9 */
10 public class VLongWritable implements WritableComparable {
11 private long value;
12
13 public VLongWritable() {}
14
15 public VLongWritable(long value) { set(value); }
16
17 /** Set the value of this LongWritable. */
18 public void set(long value) { this.value = value; }
19
20 /** Return the value of this LongWritable. */
21 public long get() { return value; }
22
23 public void readFields(DataInput in) throws IOException {
24 value = WritableUtils.readVLong(in);
25 }
26
27 public void write(DataOutput out) throws IOException {
28 WritableUtils.writeVLong(out, value);
29 }
30
31 /** Returns true if o is a VLongWritable with the same value. */
32 public boolean equals(Object o) {
33 if (!(o instanceof VLongWritable))
34 return false;
35 VLongWritable other = (VLongWritable)o;
36 return this.value == other.value;
37 }
38
39 public int hashCode() {
40 return (int)value;
41 }
42
43 /** Compares two VLongWritables. */
44 public int compareTo(Object o) {
45 long thisValue = this.value;
46 long thatValue = ((VLongWritable)o).value;
47 return (thisValue < thatValue ? -1 : (thisValue == thatValue ? 0 : 1));
48 }
49
50 public String toString() {
51 return Long.toString(value);
52 }
53
54 }
在上面可以看到它编码write时使用 WritableUtils.writeVInt(DataOutput stream, int i) 方法. WritableUtils是关于编解码等的工具类,
VIntWritable value的编码实际上是调用 WritableUtils.writeVLong(stream, i) :
1 public static void writeVInt(DataOutput stream, int i) throws IOException {
2 writeVLong(stream, i);
3 }
首先VIntWritable的长度是[1-5],VLonWritable长度是[1-9],如果数值在[-112,127]时,使用1Byte(8位二进制)表示,即编码后的1Byte存储的就是这个数值. 如果不是在这个范围内,则需要更多的Byte,而第一个Byte将被用作存储长度,其它Byte存储数值.
负数长度表示(往左依次递减,所表示的字节长度依次递增1~8) | 正数长度表示(往左依次递减,所表示的字节长度依次递增1~8) | ||||||||||||||||
Dec | -128 | -127 | -126 | -125 | -124 | -123 | -122 | -121 | -120 | -119 | -118 | -117 | -116 | -115 | -114 | -113 | [-112,127] |
Oct | 1000 0000 | 1000 0001 | 1000 0010 | 1000 0011 | 1000 0100 | 1000 0101 | 1000 0110 | 1000 0111 | 1000 1000 | 1000 1001 | 1000 1010 | 1000 1011 | 1000 1100 | 1000 1101 | 1000 1110 | 1000 1111 | 该范围内,1Byte表示即可 |
Hx | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 8A | 8B | 8C | 8D | 8E | 8F |
WritableUtils.writeVLong(DataOutput stream,long i) 源码解析如下:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 public static void writeVLong(DataOutput stream, long i) throws IOException {
2
3 if (i >= -112 && i <= 127) { //在该范围内的数字,编码后用 1Byte 存储即可.
4 stream.writeByte((byte)i);
5 return;
6 }
7
8 int len = -112; //默认i为正数,此时长度从-112开始计量
9
10 if (i < 0) { //若i为负数
11 i ^= -1L; //i=i^(-1L) 即与 ...1111 1111异或,相当于对二进制数按位取反
12 len = -120; //i<0,则长度从-120开始计量
13 }
14
15 long tmp = i;
16 /*
17 * 到这里,i为正数的不变; i为负数的,已对该数的二进制形式进行了按位取反操作,如:
18 *
19 * -158: 11111111 01100010
20 * -1: ^ 11111111 11111111
21 * ----------------------------------------
22 * 00000000 10011101 ==> 157
23 */
24 while (tmp != 0) {
25 tmp = tmp >> 8;
26 len--; //每右移8位(1个字节)后,len值减1,其表示的意义就是数值长度增1. 当tmp为0时,表示数值长度检验完毕
27 }
28
29 /*
30 * 先写入表示 i 正负和长度的表示符
31 *
32 * 正数:[-120, 表示占用8个字节长度
33 * -119, 表示占用7个字节长度
34 * -119,
35 * ...
36 * -113] 表示占用1个字节长度
37 *
38 * 负数:[-128, 表示占用8个字节长度
39 * -127, 表示占用7个字节长度
40 * -126,
41 * ...
42 * -121] 表示占用1个字节长度
43 */
44 stream.writeByte((byte)len);
45
46 len = (len < -120) ? -(len + 120) : -(len + 112); //计算占用几个字节长度
47
48 for (int idx = len; idx != 0; idx--) { //将i从高位到低位,依次写入
49 int shiftbits = (idx - 1) * 8;
50 long mask = 0xFFL << shiftbits; //确保每次左移后,只取相应位置的8位,可以与 mask 相与 '&'
51 stream.writeByte((byte)((i & mask) >> shiftbits));
52 }
53 }
再来看看变长的存储数据是怎么读取,WritableUtils.readVLong(DataInput stream) 源码解析如下:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 /** 2 * Reads a zero-compressed encoded long from input stream and returns it. 3 * @param stream Binary input stream 4 * @throws java.io.IOException 5 * @return deserialized long from stream. 6 */ 7 public static long readVLong(DataInput stream) throws IOException { 8 9 byte firstByte = stream.readByte(); //读取第一个字节 10 int len = decodeVIntSize(firstByte); //根据第一个字节判断数据存储的字节长度(包括表示正负和长度的指示符) 11 if (len == 1) { 12 return firstByte; 13 } 14 long i = 0; 15 for (int idx = 0; idx < len-1; idx++) { //遍历读取 DataInput 中的字节数据 16 byte b = stream.readByte(); 17 i = i << 8; 18 i = i | (b & 0xFF); //DataInput 字节流中可能含有负数的情况( [-121, 23, -9, 5],其中-9是负数 ==> 1111 0111),避免强转出现oxFFFFFF...的情况 19 } 20 return (isNegativeVInt(firstByte) ? (i ^ -1L) : i); 21 } 22 23 /** 24 * Parse the first byte of a vint/vlong to determine the number of bytes 25 * @param value the first byte of the vint/vlong 26 * @return the total number of bytes (1 to 9) 27 */ 28 public static int decodeVIntSize(byte value) { 29 if (value >= -112) { 30 return 1; 31 } else if (value < -120) { 32 return -119 - value; 33 } 34 return -111 - value; 35 } 36 37 /** 38 * Given the first byte of a vint/vlong, determine the sign 39 * @param value the first byte 40 * @return is the value negative 41 */ 42 public static boolean isNegativeVInt(byte value) { 43 return value < -120 || (value >= -112 && value < 0); 44 }
(1)VIntWritable 测试Demo
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package com.iresearch.hadoop.io; 2 3 import static org.hamcrest.CoreMatchers.is; 4 import static org.junit.Assert.assertThat; 5 6 import java.io.IOException; 7 8 import org.apache.hadoop.io.VIntWritable; 9 import org.apache.hadoop.io.VLongWritable; 10 import org.junit.Test; 11 12 import com.iresearch.hadoop.io.base.WritableTestBase; 13 14 public class VIntWritableTest extends WritableTestBase { 15 16 @Test 17 public void testSerialize() throws IOException{ 18 19 VIntWritable vint = new VIntWritable(-259); 20 byte[] bytes = serialize(vint); 21 System.out.println( serializeToHexString(vint) ); //860102, 2byte 22 23 VIntWritable vintNew = new VIntWritable(); 24 deserialize(vintNew, bytes); 25 26 System.out.println( vintNew.get() ); //-259 27 28 System.out.println( serializeToHexString(new VIntWritable(1)) ); //01, 1byte 29 System.out.println( serializeToHexString(new VIntWritable(-112)) ); //90, 1byte 30 System.out.println( serializeToHexString(new VIntWritable(127)) ); //7f, 1byte 31 System.out.println( serializeToHexString(new VIntWritable(128)) ); //8f80, 2byte 32 System.out.println( serializeToHexString(new VIntWritable(163)) ); //8fa3, 2byte 33 System.out.println( serializeToHexString(new VIntWritable(Integer.MAX_VALUE)) ); //8c7fffffff, 5byte 34 System.out.println( serializeToHexString(new VIntWritable(Integer.MIN_VALUE)) ); //847fffffff, 5byte 35 36 37 assertThat(serializeToHexString(new VLongWritable(1)), is("01")); // 1 byte 38 assertThat(serializeToHexString(new VLongWritable(127)), is("7f")); // 1 byte 39 assertThat(serializeToHexString(new VLongWritable(128)), is("8f80")); // 2 byte 40 assertThat(serializeToHexString(new VLongWritable(163)), is("8fa3")); // 2 byte 41 assertThat(serializeToHexString(new VLongWritable(Long.MAX_VALUE)), is("887fffffffffffffff")); // 9 byte 42 assertThat(serializeToHexString(new VLongWritable(Long.MIN_VALUE)), is("807fffffffffffffff")); // 9 byte 43 } 44 45 }
2.3 Text 文本类型
Text 类是与 Java 的String类型相对应,继承 BinaryComparable 父类,并实现WritableComparable<BinaryComparable>接口. Text 内部使用 UTF-8 的编码方式,
其提供了在字节级别上的序列化、反序列化以及大小比较方法.
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package org.apache.hadoop.io; 2 3 import java.io.IOException; 4 import java.io.DataInput; 5 import java.io.DataOutput; 6 import java.nio.ByteBuffer; 7 import java.nio.CharBuffer; 8 import java.nio.charset.CharacterCodingException; 9 import java.nio.charset.Charset; 10 import java.nio.charset.CharsetDecoder; 11 import java.nio.charset.CharsetEncoder; 12 import java.nio.charset.CodingErrorAction; 13 import java.nio.charset.MalformedInputException; 14 import java.text.CharacterIterator; 15 import java.text.StringCharacterIterator; 16 17 import org.apache.commons.logging.Log; 18 import org.apache.commons.logging.LogFactory; 19 20 public class Text extends BinaryComparable implements WritableComparable<BinaryComparable> { 21 private static final Log LOG= LogFactory.getLog(Text.class); 22 23 private static ThreadLocal<CharsetEncoder> ENCODER_FACTORY = 24 new ThreadLocal<CharsetEncoder>() { 25 protected CharsetEncoder initialValue() { 26 return Charset.forName("UTF-8").newEncoder(). 27 onMalformedInput(CodingErrorAction.REPORT). 28 onUnmappableCharacter(CodingErrorAction.REPORT); 29 } 30 }; 31 32 private static ThreadLocal<CharsetDecoder> DECODER_FACTORY = 33 new ThreadLocal<CharsetDecoder>() { 34 protected CharsetDecoder initialValue() { 35 return Charset.forName("UTF-8").newDecoder(). 36 onMalformedInput(CodingErrorAction.REPORT). 37 onUnmappableCharacter(CodingErrorAction.REPORT); 38 } 39 }; 40 41 private static final byte [] EMPTY_BYTES = new byte[0]; 42 43 private byte[] bytes; 44 private int length; 45 46 public Text() { 47 bytes = EMPTY_BYTES; 48 } 49 50 /** Construct from a string. */ 51 public Text(String string) { set(string); } 52 53 /** Construct from another text. */ 54 public Text(Text utf8) { set(utf8); } 55 56 /** Construct from a byte array. */ 57 public Text(byte[] utf8) { set(utf8); } 58 59 /** Returns the raw bytes; however, only data up to {@link #getLength()} is valid. */ 60 public byte[] getBytes() { return bytes; } 61 62 /** Returns the number of bytes in the byte array */ 63 public int getLength() { return length; } 64 65 /** 返回指定 position 用于表示 Unicode 代码点的int类型,与 String 对象返回一个 char 类型不同 */ 66 public int charAt(int position) { 67 if (position > this.length) return -1; // too long 68 if (position < 0) return -1; // duh. 69 70 ByteBuffer bb = (ByteBuffer)ByteBuffer.wrap(bytes).position(position); 71 return bytesToCodePoint(bb.slice()); 72 } 73 74 public int find(String what) { return find(what, 0); } 75 76 /** 77 * find 方法与 String 的indexOf 方法相对应,用于返回某个子串在 Text 对象 78 * 所封装的字节数组中所出现的第一位置 79 */ 80 /** 81 * Finds any occurence of <code>what</code> in the backing 82 * buffer, starting as position <code>start</code>. The starting 83 * position is measured in bytes and the return value is in 84 * terms of byte position in the buffer. The backing buffer is 85 * not converted to a string for this operation. 86 * @return byte position of the first occurence of the search 87 * string in the UTF-8 buffer or -1 if not found 88 */ 89 public int find(String what, int start) { 90 try { 91 ByteBuffer src = ByteBuffer.wrap(this.bytes,0,this.length); 92 ByteBuffer tgt = encode(what); 93 byte b = tgt.get(); 94 src.position(start); 95 96 while (src.hasRemaining()) { 97 if (b == src.get()) { // matching first byte 98 src.mark(); // save position in loop 99 tgt.mark(); // save position in target 100 boolean found = true; 101 int pos = src.position()-1; 102 while (tgt.hasRemaining()) { 103 if (!src.hasRemaining()) { // src expired first 104 tgt.reset(); 105 src.reset(); 106 found = false; 107 break; 108 } 109 if (!(tgt.get() == src.get())) { 110 tgt.reset(); 111 src.reset(); 112 found = false; 113 break; // no match 114 } 115 } 116 if (found) return pos; 117 } 118 } 119 return -1; // not found 120 } catch (CharacterCodingException e) { 121 // can't get here 122 e.printStackTrace(); 123 return -1; 124 } 125 } 126 /** Set to contain the contents of a string. */ 127 public void set(String string) { 128 try { 129 ByteBuffer bb = encode(string, true); 130 bytes = bb.array(); 131 length = bb.limit(); 132 }catch(CharacterCodingException e) { 133 throw new RuntimeException("Should not have happened " + e.toString()); 134 } 135 } 136 137 /** Set to a utf8 byte array. */ 138 public void set(byte[] utf8) { set(utf8, 0, utf8.length); } 139 140 /** copy a text. */ 141 public void set(Text other) { set(other.getBytes(), 0, other.getLength()); } 142 143 /** 144 * 重载的 set 方法,完成对 Text 对象变量的初始化 145 * setCapacity(int len, boolean keepData) 方法对 Text 对象的 bytes 容量进行赋值,并根据 146 * boolean keepData来判断是否保存原来 bytes 中数据. 147 * 148 * System.arraycopy(Object src, int srcPos, Object dest, int destPos, int length) 149 * src 要复制的数组 150 * srcPos 从源数组的第几位开始复制 151 * dest 复制的目标数组 152 * destPos 复制到目标数组时,从第几位开始存储 153 * length 要复制的数据长度 154 */ 155 /** 156 * Set the Text to range of bytes 157 * @param utf8 the data to copy from 158 * @param start the first position of the new string 159 * @param len the number of bytes of the new string 160 */ 161 public void set(byte[] utf8, int start, int len) { 162 setCapacity(len, false); 163 System.arraycopy(utf8, start, bytes, 0, len); 164 this.length = len; 165 } 166 167 /** 向 Text 所封装的字节数组末尾添加字节数组 */ 168 /** 169 * Append a range of bytes to the end of the given text 170 * @param utf8 the data to copy from 171 * @param start the first position to append from utf8 172 * @param len the number of bytes to append 173 */ 174 public void append(byte[] utf8, int start, int len) { 175 setCapacity(length + len, true); 176 System.arraycopy(utf8, start, bytes, length, len); 177 length += len; 178 } 179 180 /** 清空 Text 的值,将字节的长度设置为0 */ 181 /** Clear the string to empty. */ 182 public void clear() { length = 0; } 183 184 /* 185 * Sets the capacity of this Text object to <em>at least</em> 186 * <code>len</code> bytes. If the current buffer is longer, 187 * then the capacity and existing content of the buffer are 188 * unchanged. If <code>len</code> is larger 189 * than the current capacity, the Text object's capacity is 190 * increased to match. 191 * @param len the number of bytes we need 192 * @param keepData should the old data be kept 193 */ 194 private void setCapacity(int len, boolean keepData) { 195 if (bytes == null || bytes.length < len) { 196 byte[] newBytes = new byte[len]; 197 if (bytes != null && keepData) { 198 System.arraycopy(bytes, 0, newBytes, 0, length); 199 } 200 bytes = newBytes; 201 } 202 } 203 204 /** 205 * Convert text back to string 206 * @see java.lang.Object#toString() 207 */ 208 public String toString() { 209 try { 210 return decode(bytes, 0, length); 211 } catch (CharacterCodingException e) { 212 throw new RuntimeException("Should not have happened " + e.toString()); 213 } 214 } 215 216 /** 对 Text 对象的序列化和反序列化操作 */ 217 /** serialize 218 * write this object to out 219 * length uses zero-compressed encoding 220 * @see Writable#write(DataOutput) 221 */ 222 public void write(DataOutput out) throws IOException { 223 WritableUtils.writeVInt(out, length); 224 out.write(bytes, 0, length); 225 } 226 227 /** deserialize */ 228 public void readFields(DataInput in) throws IOException { 229 int newLength = WritableUtils.readVInt(in); 230 setCapacity(newLength, false); 231 in.readFully(bytes, 0, newLength); 232 length = newLength; 233 } 234 235 /** Skips over one Text in the input. */ 236 public static void skip(DataInput in) throws IOException { 237 int length = WritableUtils.readVInt(in); 238 WritableUtils.skipFully(in, length); 239 } 240 241 /** Returns true iff <code>o</code> is a Text with the same contents. */ 242 public boolean equals(Object o) { 243 if (o instanceof Text) 244 return super.equals(o); 245 return false; 246 } 247 248 public int hashCode() { return super.hashCode(); } 249 250 /** A WritableComparator optimized for Text keys. */ 251 public static class Comparator extends WritableComparator { 252 public Comparator() { 253 super(Text.class); 254 } 255 256 public int compare(byte[] b1, int s1, int l1, 257 byte[] b2, int s2, int l2) { 258 int n1 = WritableUtils.decodeVIntSize(b1[s1]); 259 int n2 = WritableUtils.decodeVIntSize(b2[s2]); 260 return compareBytes(b1, s1+n1, l1-n1, b2, s2+n2, l2-n2); 261 } 262 } 263 264 static { 265 // register this comparator 266 WritableComparator.define(Text.class, new Comparator()); 267 } 268 269 /// STATIC UTILITIES FROM HERE DOWN 270 /** 271 * Converts the provided byte array to a String using the 272 * UTF-8 encoding. If the input is malformed, 273 * replace by a default value. 274 */ 275 public static String decode(byte[] utf8) throws CharacterCodingException { 276 return decode(ByteBuffer.wrap(utf8), true); 277 } 278 279 public static String decode(byte[] utf8, int start, int length) 280 throws CharacterCodingException { 281 return decode(ByteBuffer.wrap(utf8, start, length), true); 282 } 283 284 /** 将 UTF-8 编码的字节数组转化为 String 的不同重载实现 */ 285 /** 286 * Converts the provided byte array to a String using the 287 * UTF-8 encoding. If <code>replace</code> is true, then 288 * malformed input is replaced with the 289 * substitution character, which is U+FFFD. Otherwise the 290 * method throws a MalformedInputException. 291 */ 292 public static String decode(byte[] utf8, int start, int length, boolean replace) 293 throws CharacterCodingException { 294 return decode(ByteBuffer.wrap(utf8, start, length), replace); 295 } 296 297 private static String decode(ByteBuffer utf8, boolean replace) 298 throws CharacterCodingException { 299 CharsetDecoder decoder = DECODER_FACTORY.get(); 300 if (replace) { 301 decoder.onMalformedInput( 302 java.nio.charset.CodingErrorAction.REPLACE); 303 decoder.onUnmappableCharacter(CodingErrorAction.REPLACE); 304 } 305 String str = decoder.decode(utf8).toString(); 306 // set decoder back to its default value: REPORT 307 if (replace) { 308 decoder.onMalformedInput(CodingErrorAction.REPORT); 309 decoder.onUnmappableCharacter(CodingErrorAction.REPORT); 310 } 311 return str; 312 } 313 314 /** 使用 UTF-8 的编码方式将 String 转化为字节缓冲(数组)的 不同重载实现 */ 315 /** 316 * Converts the provided String to bytes using the 317 * UTF-8 encoding. If the input is malformed, 318 * invalid chars are replaced by a default value. 319 * @return ByteBuffer: bytes stores at ByteBuffer.array() 320 * and length is ByteBuffer.limit() 321 */ 322 323 public static ByteBuffer encode(String string) 324 throws CharacterCodingException { 325 return encode(string, true); 326 } 327 328 /** 329 * Converts the provided String to bytes using the 330 * UTF-8 encoding. If <code>replace</code> is true, then 331 * malformed input is replaced with the 332 * substitution character, which is U+FFFD. Otherwise the 333 * method throws a MalformedInputException. 334 * @return ByteBuffer: bytes stores at ByteBuffer.array() 335 * and length is ByteBuffer.limit() 336 */ 337 public static ByteBuffer encode(String string, boolean replace) 338 throws CharacterCodingException { 339 CharsetEncoder encoder = ENCODER_FACTORY.get(); 340 if (replace) { 341 encoder.onMalformedInput(CodingErrorAction.REPLACE); 342 encoder.onUnmappableCharacter(CodingErrorAction.REPLACE); 343 } 344 ByteBuffer bytes = encoder.encode(CharBuffer.wrap(string.toCharArray())); 345 if (replace) { 346 encoder.onMalformedInput(CodingErrorAction.REPORT); 347 encoder.onUnmappableCharacter(CodingErrorAction.REPORT); 348 } 349 return bytes; 350 } 351 352 /** Read a UTF8 encoded string from in 353 */ 354 public static String readString(DataInput in) throws IOException { 355 int length = WritableUtils.readVInt(in); 356 byte [] bytes = new byte[length]; 357 in.readFully(bytes, 0, length); 358 return decode(bytes); 359 } 360 361 /** Write a UTF8 encoded string to out 362 */ 363 public static int writeString(DataOutput out, String s) throws IOException { 364 ByteBuffer bytes = encode(s); 365 int length = bytes.limit(); 366 WritableUtils.writeVInt(out, length); 367 out.write(bytes.array(), 0, length); 368 return length; 369 } 370 371 // states for validateUTF8 372 373 private static final int LEAD_BYTE = 0; 374 375 private static final int TRAIL_BYTE_1 = 1; 376 377 private static final int TRAIL_BYTE = 2; 378 379 /** 380 * Check if a byte array contains valid utf-8 381 * @param utf8 byte array 382 * @throws MalformedInputException if the byte array contains invalid utf-8 383 */ 384 public static void validateUTF8(byte[] utf8) throws MalformedInputException { 385 validateUTF8(utf8, 0, utf8.length); 386 } 387 388 /** 389 * Check to see if a byte array is valid utf-8 390 * @param utf8 the array of bytes 391 * @param start the offset of the first byte in the array 392 * @param len the length of the byte sequence 393 * @throws MalformedInputException if the byte array contains invalid bytes 394 */ 395 public static void validateUTF8(byte[] utf8, int start, int len) 396 throws MalformedInputException { 397 int count = start; 398 int leadByte = 0; 399 int length = 0; 400 int state = LEAD_BYTE; 401 while (count < start+len) { 402 int aByte = ((int) utf8[count] & 0xFF); 403 404 switch (state) { 405 case LEAD_BYTE: 406 leadByte = aByte; 407 length = bytesFromUTF8[aByte]; 408 409 switch (length) { 410 case 0: // check for ASCII 411 if (leadByte > 0x7F) 412 throw new MalformedInputException(count); 413 break; 414 case 1: 415 if (leadByte < 0xC2 || leadByte > 0xDF) 416 throw new MalformedInputException(count); 417 state = TRAIL_BYTE_1; 418 break; 419 case 2: 420 if (leadByte < 0xE0 || leadByte > 0xEF) 421 throw new MalformedInputException(count); 422 state = TRAIL_BYTE_1; 423 break; 424 case 3: 425 if (leadByte < 0xF0 || leadByte > 0xF4) 426 throw new MalformedInputException(count); 427 state = TRAIL_BYTE_1; 428 break; 429 default: 430 // too long! Longest valid UTF-8 is 4 bytes (lead + three) 431 // or if < 0 we got a trail byte in the lead byte position 432 throw new MalformedInputException(count); 433 } // switch (length) 434 break; 435 436 case TRAIL_BYTE_1: 437 if (leadByte == 0xF0 && aByte < 0x90) 438 throw new MalformedInputException(count); 439 if (leadByte == 0xF4 && aByte > 0x8F) 440 throw new MalformedInputException(count); 441 if (leadByte == 0xE0 && aByte < 0xA0) 442 throw new MalformedInputException(count); 443 if (leadByte == 0xED && aByte > 0x9F) 444 throw new MalformedInputException(count); 445 // falls through to regular trail-byte test!! 446 case TRAIL_BYTE: 447 if (aByte < 0x80 || aByte > 0xBF) 448 throw new MalformedInputException(count); 449 if (--length == 0) { 450 state = LEAD_BYTE; 451 } else { 452 state = TRAIL_BYTE; 453 } 454 break; 455 } // switch (state) 456 count++; 457 } 458 } 459 460 /** 461 * Magic numbers for UTF-8. These are the number of bytes 462 * that <em>follow</em> a given lead byte. Trailing bytes 463 * have the value -1. The values 4 and 5 are presented in 464 * this table, even though valid UTF-8 cannot include the 465 * five and six byte sequences. 466 */ 467 static final int[] bytesFromUTF8 = 468 { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 469 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 470 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 471 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 472 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 473 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 474 0, 0, 0, 0, 0, 0, 0, 475 // trail bytes 476 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 477 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 478 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 479 -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 480 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 481 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 482 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5 }; 483 484 /** 485 * Returns the next code point at the current position in 486 * the buffer. The buffer's position will be incremented. 487 * Any mark set on this buffer will be changed by this method! 488 */ 489 public static int bytesToCodePoint(ByteBuffer bytes) { 490 bytes.mark(); 491 byte b = bytes.get(); 492 bytes.reset(); 493 int extraBytesToRead = bytesFromUTF8[(b & 0xFF)]; 494 if (extraBytesToRead < 0) return -1; // trailing byte! 495 int ch = 0; 496 497 switch (extraBytesToRead) { 498 case 5: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */ 499 case 4: ch += (bytes.get() & 0xFF); ch <<= 6; /* remember, illegal UTF-8 */ 500 case 3: ch += (bytes.get() & 0xFF); ch <<= 6; 501 case 2: ch += (bytes.get() & 0xFF); ch <<= 6; 502 case 1: ch += (bytes.get() & 0xFF); ch <<= 6; 503 case 0: ch += (bytes.get() & 0xFF); 504 } 505 ch -= offsetsFromUTF8[extraBytesToRead]; 506 507 return ch; 508 } 509 510 511 static final int offsetsFromUTF8[] = 512 { 0x00000000, 0x00003080, 513 0x000E2080, 0x03C82080, 0xFA082080, 0x82082080 }; 514 515 /** 516 * For the given string, returns the number of UTF-8 bytes 517 * required to encode the string. 518 * @param string text to encode 519 * @return number of UTF-8 bytes required to encode 520 */ 521 public static int utf8Length(String string) { 522 CharacterIterator iter = new StringCharacterIterator(string); 523 char ch = iter.first(); 524 int size = 0; 525 while (ch != CharacterIterator.DONE) { 526 if ((ch >= 0xD800) && (ch < 0xDC00)) { 527 // surrogate pair? 528 char trail = iter.next(); 529 if ((trail > 0xDBFF) && (trail < 0xE000)) { 530 // valid pair 531 size += 4; 532 } else { 533 // invalid pair 534 size += 3; 535 iter.previous(); // rewind one 536 } 537 } else if (ch < 0x80) { 538 size++; 539 } else if (ch < 0x800) { 540 size += 2; 541 } else { 542 // ch < 0x10000, that is, the largest char value 543 size += 3; 544 } 545 ch = iter.next(); 546 } 547 return size; 548 } 549 }
Unicode、UTF-8、Java的同一字符的不同表现形式:
(1)Text 测试Demo
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package com.iresearch.hadoop.io; 2 3 import static org.hamcrest.CoreMatchers.is; 4 //import static org.hamcrest.Matchers.greaterThan; 5 //import static org.hamcrest.Matchers.lessThan; 6 import static org.junit.Assert.assertThat; 7 8 import java.io.UnsupportedEncodingException; 9 import java.nio.ByteBuffer; 10 11 import org.apache.hadoop.io.Text; 12 import org.junit.Test; 13 14 import com.iresearch.hadoop.io.base.WritableTestBase; 15 16 public class TextTest extends WritableTestBase { 17 18 @Test 19 public void test(){ 20 21 Text text =new Text("hadoop"); 22 23 //getLength(), getBytes().length 24 assertThat( text.getLength(), is(6) ); 25 assertThat( text.getBytes().length, is(6) ); 26 27 //charAt() 28 System.out.println(text.charAt(0)); 29 assertThat( text.charAt(0), is((int) 'h') ); 30 assertThat("Out of bounds..", text.charAt(100), is(-1) ); 31 32 //find() 33 assertThat("find a substring from Text", text.find("do"), is(2)); 34 assertThat("find first 'o' ", text.find("o"), is(3)); 35 assertThat("find 'p' from position 5 ", text.find("p",5), is(5)); 36 assertThat("No match", text.find("hive"), is(-1)); 37 } 38 39 //part1: Text 与 java.lang.String 区别之 索引 40 @Test 41 public void testIndex() throws UnsupportedEncodingException{ 42 Text text =new Text("\u0041\u00DF\u6771\uD801\uDC00"); 43 String str = new String("\u0041\u00DF\u6771\uD801\uDC00"); 44 45 // ^^ step 1: Test for String 46 assertThat(str.length(), is(5)); 47 assertThat(str.getBytes("UTF-8").length, is(10)); 48 //indexOf(String str) 49 assertThat(str.indexOf("\u0041"), is(0)); 50 assertThat(str.indexOf("\u00DF"), is(1)); 51 assertThat(str.indexOf("\u6771"), is(2)); 52 assertThat(str.indexOf("\uD801"), is(3)); 53 assertThat(str.indexOf("\uDC00"), is(4)); 54 //charAt(int index) 55 assertThat(str.charAt(0), is('\u0041')); 56 assertThat(str.charAt(1), is('\u00DF')); 57 assertThat(str.charAt(2), is('\u6771')); 58 assertThat(str.charAt(3), is('\uD801')); 59 assertThat(str.charAt(4), is('\uDC00')); 60 //codePointAt(int index) 61 assertThat(str.codePointAt(0), is(0x0041)); 62 assertThat(str.codePointAt(1), is(0x00DF)); 63 assertThat(str.codePointAt(2), is(0x6771)); 64 assertThat(str.codePointAt(3), is(0x10400)); // \uD801\uDC00, 这里表示的是一个Unicode编码 65 // vv step 1: Test for String 66 67 // ^^ step 2: Test for Text 68 assertThat(text.getLength(), is(10)); 69 //find(String str) 找到该字符的 UTF-8二进制编码在Text对象的字节偏移量 70 assertThat(text.find("\u0041"), is(0)); 71 assertThat(text.find("\u00DF"), is(1)); 72 assertThat(text.find("\u6771"), is(3)); 73 assertThat(text.find("\uD801"), is(-1)); //在Text对象的 UTF-8编码中 '\uD801\uDC00'是相当于一个候补字符 74 assertThat(text.find("\uDC00"), is(-1)); 75 assertThat(text.find("\uD801\uDC00"), is(6)); 76 77 //charAt(int position) 78 assertThat(text.charAt(0), is(0x0041)); //该方法和 java.lang.String的codePointAt(int index)类似 79 assertThat(text.charAt(1), is(0x00DF)); 80 assertThat(text.charAt(2), is(-1)); 81 assertThat(text.charAt(3), is(0x6771)); 82 assertThat(text.charAt(6), is(0x10400)); 83 84 System.out.println(text.getLength()); //10 85 System.out.println(text.getBytes().length); //11 86 /* 87 * [position,limit,capacity] ===> [0, 10, 11] 88 * 89 * public Text(String string) { 90 * set(string); 91 * } 92 * public void set(String string) { 93 * try { 94 * ByteBuffer bb = encode(string, true); 95 * bytes = bb.array(); 96 * length = bb.limit(); 97 * }catch(CharacterCodingException e) { 98 * throw new RuntimeException("Should not have happened " + e.toString()); 99 * } 100 * } 101 */ 102 System.out.println(str.length()); //5 103 // vv step 2: Test for Text 104 } 105 106 //part2: 对Text对象的遍历 107 @Test 108 public void testForEachText(){ 109 Text text =new Text("\u0041\u00DF\u6771\uD801\uDC00"); 110 111 ByteBuffer buffer = ByteBuffer.wrap(text.getBytes(), 0, text.getLength()); 112 int mark; 113 while( buffer.hasRemaining() && (mark=Text.bytesToCodePoint(buffer))!= -1 ) { 114 System.out.println(Integer.toHexString(mark)); 115 } 116 //41 117 //df 118 //6771 119 //10400 120 } 121 122 //part3: Text的易变性 123 @Test 124 public void testMutability(){ 125 126 Text text = new Text("hadoop"); // ==> [104, 97, 100, 111, 111, 112] 127 128 /* Text 易变性的测试,与所有的Writable接口实现相似,NullWritable除外 */ 129 //text.set("hive"); 130 //System.out.println(text.getLength()); ==>4 131 //System.out.println(text.getBytes().length); ==>4 132 133 /* getBytes()方法返回的字节数组长度可能比getLength()长 */ 134 text.set(new Text("hive")); // ==> [104, 105, 118, 101] 135 System.out.println(text.getLength()); // ==>4 136 System.out.println(text.getBytes().length);// ==>6 长度不变 , bytes=[104, 105, 118, 101, 111, 112] 137 138 /* 139 public void set(String string) { 140 try { 141 ByteBuffer bb = encode(string, true); 142 bytes = bb.array(); 143 length = bb.limit(); 144 }catch(CharacterCodingException e) { 145 throw new RuntimeException("Should not have happened " + e.toString()); 146 } 147 } 148 149 public void set(Text other) { 150 set(other.getBytes(), 0, other.getLength()); 151 } 152 153 public void set(byte[] utf8, int start, int len) { 154 setCapacity(len, false); 155 //将utf8[104, 105, 118, 101]字节数组,长度为4覆盖到bytes[104, 97, 100, 111, 111, 112]中 ,结果为[104, 105, 118, 101, 111, 112] 156 System.arraycopy(utf8, start, bytes, 0, len); 157 this.length = len; //此时this.length = len = 4 158 } 159 */ 160 } 161 }
Text类并不像java.lang.String类具有丰富的字符串操作API,所以在多数情况下,需要先将 Text 对象转换成 String对象,通常调用其toString()实现:
1 assert(new Text("hadoop").toString(), is("hadoop")); 2 3 //toString() 方法源代码 4 public String toString() { 5 try { 6 return decode(bytes, 0, length); 7 } catch (CharacterCodingException e) { 8 throw new RuntimeException("Should not have happened " + e.toString()); 9 } 10 }
2.4 BytesWritable 类型
BytesWritable是对二进制数组的封装,它的序列化格式为一个用于指定后面存储数据字节长度的正数域(4个字节),再跟实际存储的数据字节本身. 如,长度为2的字节数组包含数值3和5,序列化
形成一个4字节的整数(0x00000002)和该数组中的两个字节(03和05). 测试 Demo 如下:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package com.iresearch.hadoop.io; 2 import static org.hamcrest.CoreMatchers.is; 3 //import static org.hamcrest.Matchers.greaterThan; 4 //import static org.hamcrest.Matchers.lessThan; 5 import static org.junit.Assert.assertThat; 6 7 import java.io.IOException; 8 import java.util.Arrays; 9 10 import org.apache.hadoop.io.BytesWritable; 11 import org.junit.Test; 12 13 import com.iresearch.hadoop.io.base.WritableTestBase; 14 15 public class BytesWritableTest extends WritableTestBase { 16 17 @Test 18 public void test() throws IOException{ 19 20 // 观察 BytesWritable 序列化的数据形式 21 BytesWritable bytesWritable = new BytesWritable(new byte[]{3, 5}); 22 System.out.println(Arrays.toString( serialize(bytesWritable) )); //[0, 0, 0, 2, 3, 5] 23 assertThat(serializeToHexString(bytesWritable), is("000000020305")); 24 25 // BytesWritable getLength()和getBytes().length 方法的区别 26 bytesWritable.setCapacity(10); 27 assertThat( bytesWritable.getLength(), is(2) ); //2 28 System.out.println(Arrays.toString( serialize(bytesWritable) )); //[0, 0, 0, 2, 3, 5] 29 assertThat( bytesWritable.getBytes().length, is(10) ); //10 30 31 bytesWritable.setCapacity(1); 32 assertThat( bytesWritable.getLength(), is(1) ); //1 33 System.out.println(Arrays.toString( serialize(bytesWritable) )); //[0, 0, 0, 1, 3] 34 assertThat( bytesWritable.getBytes().length, is(1) ); //1 35 /* 36 public BytesWritable(byte[] bytes) { 37 this.bytes = bytes; 38 this.size = bytes.length; 39 } 40 public void setCapacity(int new_cap) { 41 if (new_cap != getCapacity()) { 42 byte[] new_data = new byte[new_cap]; 43 if (new_cap < size) { 44 size = new_cap; 45 } 46 if (size != 0) { 47 System.arraycopy(bytes, 0, new_data, 0, size); 48 } 49 bytes = new_data; 50 } 51 } 52 */ 53 } 54 }
2.5 NullWritable 类型
NullWritable 是一个单例对象,因此该类是不可以被修改的. NullWritable 的序列化长度为0(即没有字节被写入流,也没有从流中读出字节,一般被当作占位符来使用)
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package org.apache.hadoop.io; 2 3 import java.io.*; 4 5 /** Singleton Writable with no data. */ 6 public class NullWritable implements WritableComparable { 7 8 private static final NullWritable THIS = new NullWritable(); 9 10 /** NullWritable 的构造函数是私有的,即是一个单例对象,该类是不可以被修改的. */ 11 // no public ctor 12 private NullWritable() {} 13 14 /** Returns the single instance of this class. */ 15 public static NullWritable get() { return THIS; } 16 17 public String toString() { 18 return "(null)"; 19 } 20 21 public int hashCode() { return 0; } 22 public int compareTo(Object other) { 23 if (!(other instanceof NullWritable)) { 24 throw new ClassCastException("can't compare " + other.getClass().getName() + " to NullWritable"); 25 } 26 return 0; 27 } 28 public boolean equals(Object other) { return other instanceof NullWritable; } 29 public void readFields(DataInput in) throws IOException {} 30 public void write(DataOutput out) throws IOException {} 31 32 /** A Comparator "optimized" for NullWritable. */ 33 public static class Comparator extends WritableComparator { 34 public Comparator() { 35 super(NullWritable.class); 36 } 37 38 /** Compare the buffers in serialized form. */ 39 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { 40 assert 0 == l1; 41 assert 0 == l2; 42 return 0; 43 } 44 } 45 46 // register this comparator 47 static { 48 WritableComparator.define(NullWritable.class, new Comparator()); 49 } 50 }
2.6 ObjectWritable和GenericWritable
ObjectWritable is a general-purpose wrapper for the following: Java primitives, String, enum, Writable, null, or arrays of any of these types.
It is used in Hadoop RPC to marshal(包装) and unmarshal method arguments and return types. 其实主要的通途就是对多于1个的域组成对象进行序列化. 在对端进行
反序列化的时候用到了 WritableFactory 和 WritableFactories(用来根据类名来生成对象)
(1)ObjectWritable 测试Demo
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package com.iresearch.hadoop.io; 2 import java.io.IOException; 3 4 import org.apache.hadoop.io.ObjectWritable; 5 import org.apache.hadoop.io.Text; 6 import org.apache.hadoop.util.StringUtils; 7 8 import com.iresearch.hadoop.io.base.WritableTestBase; 9 10 public class ObjectWritableTest extends WritableTestBase { 11 12 public static void main(String[] args) throws IOException { 13 14 Text text = new Text("\u0041"); 15 ObjectWritable writable = new ObjectWritable(text); 16 System.out.println( StringUtils.byteToHexString( serialize(writable)) ); 17 //00196f72672e6170616368652e6861646f6f702e696f2e5465787400196f72672e6170616368652e6861646f6f702e696f2e546578740141 18 //(a)0019 6f72672e6170616368652e6861646f6f702e696f2e54657874, (b)0019 6f72672e6170616368652e6861646f6f702e696f2e54657874,(c)0141 19 /* 20 21 序列化过程见ObjectWritable的writeObject(DataOutput out, Object instance, Class declaredClass, Configuration conf)方法 22 23 (1)序列化 ObjectWritable 的声明部分 24 UTF8.writeString(out, declaredClass.getName()); ==> 25 26 0019 6f72672e6170616368652e6861646f6f702e696f2e54657874(第一部分是一个short数值,为该对象class名字的字符串长度,org.apache.hadoop.io.Text,25位=0x0019) 27 (2)序列化 Writable 接口对象的实现类 28 if (Writable.class.isAssignableFrom(declaredClass)) { // Writable接口实现类 29 UTF8.writeString(out, instance.getClass().getName()); 30 ((Writable)instance).write(out); 31 } ==> 32 33 0019 6f72672e6170616368652e6861646f6f702e696f2e54657874 34 0141(可变长Text的序列化值,0x01长度,0x41数值内容) 35 */ 36 37 ObjectWritable srcWritable = new ObjectWritable(Integer.TYPE, 188); 38 ObjectWritable destWritable = new ObjectWritable(); 39 cloneInto(srcWritable, destWritable); 40 System.out.println( serializeToHexString(srcWritable) ); //0003696e74000000bc 41 System.out.println((Integer)destWritable.get()); //188 42 } 43 }
从上述的测试Demo结果,可以看出 ObjectWritable 作为一个通用机制,每次序列化都需要写入封装类型的class名称,这非常浪费空间. GenericWritable 的作用就是在如果封装的类型数量
比较少并且能够提前知道,那么就可以通过使用静态类型的数组,并使用对序列化后的类型引用加入位置索引来提供性能. 我们可以在继承的子类中指定需要支持的类型,案例如下:
(2)GenericWritable 测试Demo
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package com.iresearch.hadoop.io; 2 3 import java.io.IOException; 4 //import java.util.EmptyStackException; 5 6 7 import org.apache.hadoop.io.BytesWritable; 8 import org.apache.hadoop.io.GenericWritable; 9 import org.apache.hadoop.io.Text; 10 import org.apache.hadoop.io.Writable; 11 import org.apache.hadoop.util.StringUtils; 12 13 import com.iresearch.hadoop.io.base.WritableTestBase; 14 15 public class GenericWritableTest extends WritableTestBase{ 16 17 public static void main(String[] args) throws IOException { 18 Text text = new Text("hadoop"); 19 MyGenericWritable writable = new MyGenericWritable(text); 20 21 System.out.println(StringUtils.byteToHexString( serialize(text) )); //066861646f6f70 22 System.out.println(StringUtils.byteToHexString( serialize(writable) )); //00066861646f6f70 ==> 00,066861646f6f70(org.apache.hadoop.io.Text在classes的第一位) 23 24 writable.set(new BytesWritable(new byte[]{3,5})); 25 System.out.println(serializeToHexString(writable)); //01000000020305 ==> 01,000000020305(org.apache.hadoop.io.BytesWritable在classes的第二位) 26 System.out.println( ((BytesWritable)writable.get()).toString() ); //03 05 27 System.out.println( writable.toString() ); //GW[class=org.apache.hadoop.io.BytesWritable,value=03 05] 28 29 /* 30 //GenericWritable 对象的set(Writable obj)方法,重置 instance 和 type 的值 31 public void set(Writable obj) { 32 instance = obj; 33 Class<? extends Writable> instanceClazz = instance.getClass(); 34 Class<? extends Writable>[] clazzes = getTypes(); 35 for (int i = 0; i < clazzes.length; i++) { 36 Class<? extends Writable> clazz = clazzes[i]; 37 if (clazz.equals(instanceClazz)) { 38 type = (byte) i; 39 return; 40 } 41 } 42 throw new RuntimeException("The type of instance is: " + instance.getClass() + ", which is NOT registered."); 43 } 44 45 //GenericWritable 序列化方法 46 public void write(DataOutput out) throws IOException { 47 if (type == NOT_SET || instance == null) 48 throw new IOException("The GenericWritable has NOT been set correctly. type=" + type + ", instance=" + instance); 49 out.writeByte(type); //这里type值等于 需要包装的对象在 MyGenericWritable.classes 中的索引位置 50 instance.write(out); 51 } 52 53 */ 54 } 55 } 56 57 58 @SuppressWarnings("unchecked") 59 class MyGenericWritable extends GenericWritable { 60 61 public MyGenericWritable(Writable writable){ 62 set(writable); 63 } 64 65 public static Class<? extends Writable>[] classes = null; 66 67 static { 68 classes = (Class<? extends Writable>[])new Class[]{ 69 Text.class, BytesWritable.class 70 }; 71 } 72 73 @Override 74 protected Class<? extends Writable>[] getTypes() { 75 return classes; 76 } 77 78 }
3 集合数据类型
在 org.apache.hadoop.io 包中,有6个 Writable 集合类:ArrayWritable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable, SortedMapWritable 和 EnumSetWritable.
3.1 ArrayWritable 和 TwoDArrayWritable
ArrayWritable 和 TwoDArrayWritable 是对 Writable 的数组和二维数组的实现,ArrayWritable 和 TwoDArrayWritable 中所有元素必须是同一类的实例(在构造函数中指定),如下所以:
1 ArrayWritable arrayWritable = new ArrayWritable(Text.class);
ArrayWritable 和 TwoDArrayWritable 都有set(), get() 和 toArray()方法,测试Demo如下:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package com.iresearch.hadoop.io; 2 3 import static org.hamcrest.CoreMatchers.is; 4 import static org.junit.Assert.assertThat; 5 6 import java.io.IOException; 7 import java.util.Arrays; 8 9 import org.apache.hadoop.io.ArrayWritable; 10 import org.apache.hadoop.io.Text; 11 import org.junit.Test; 12 13 import com.iresearch.hadoop.io.base.WritableTestBase; 14 15 public class ArrayWritableTest extends WritableTestBase { 16 17 @Test 18 public void testArrayWritable() throws IOException{ 19 20 ArrayWritable arrayWritable = new ArrayWritable(Text.class); 21 arrayWritable.set(new Text[]{new Text("hadoop"), new Text("hive")}); 22 23 //先写入表示数组长度的int值,最后依次写入序列化后的存储对象值 0002,6 1049700111111112,4 104105118101 24 System.out.println( Arrays.toString(serialize(arrayWritable)) ); //[0, 0, 0, 2, 6, 104, 97, 100, 111, 111, 112, 4, 104, 105, 118, 101] 25 26 /* 27 28 public void readFields(DataInput in) throws IOException { 29 values = new Writable[in.readInt()]; // construct values 30 for (int i = 0; i < values.length; i++) { 31 Writable value = WritableFactories.newInstance(valueClass); 32 value.readFields(in); // read a value 33 values[i] = value; // store it in values 34 } 35 } 36 37 public void write(DataOutput out) throws IOException { 38 out.writeInt(values.length); // write values 39 for (int i = 0; i < values.length; i++) { 40 values[i].write(out); 41 } 42 } 43 44 */ 45 46 MyArrayWritable myWritable = new MyArrayWritable(); 47 cloneInto(arrayWritable, myWritable); 48 assertThat(myWritable.get().length, is(2)); 49 assertThat((Text)myWritable.get()[0], is(new Text("hadoop"))); 50 51 //测试 ArrayWritable 的toArray()方法 52 Text[] textArray = (Text[])myWritable.toArray(); 53 System.out.println(textArray[1].toString()); //hive 54 } 55 } 56 57 class MyArrayWritable extends ArrayWritable{ 58 59 public MyArrayWritable() { 60 super(Text.class); 61 } 62 63 }
3.2 MapWritable
MapWritable 和 SortedMapWritable 分别实现了 java.util.Map<Writable, Writable> 和 java.util.SortedMap<WritableComparable, Writable> ,每个键和值使用的类型是相应
字段序列化形成的一部分. 类型存储为单个字节(充当类型数组的索引). 在 org.apache.hadoop.io 包中,数组经常与标准类型和定制的 Writable 类型结合使用,但对于非标准类型,则需要在包
头中指明所使用的数组类型. 根据实现,MapWritable 和 SortedMapWritable 通过正 byte 值(1~127)来指示定制的类型,所以在 MapWritable 和 SortedMapWritalbe 实例中最多可以使用
127个不同的非标准 Writable 类. 测试 Demo 如下:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package com.iresearch.hadoop.io; 2 3 import static org.hamcrest.CoreMatchers.is; 4 //import static org.hamcrest.Matchers.greaterThan; 5 //import static org.hamcrest.Matchers.lessThan; 6 import static org.junit.Assert.assertThat; 7 8 import java.io.IOException; 9 10 import org.apache.hadoop.io.BytesWritable; 11 import org.apache.hadoop.io.IntWritable; 12 import org.apache.hadoop.io.MapWritable; 13 import org.apache.hadoop.io.Text; 14 import org.apache.hadoop.io.VIntWritable; 15 import org.junit.Test; 16 17 import com.iresearch.hadoop.io.base.WritableTestBase; 18 19 public class MapWritableTest extends WritableTestBase { 20 21 //测试 MapWritable 中的key 22 @Test 23 public void testKeyInMapWritable() throws IOException{ 24 MapWritable mapWritable = new MapWritable(); 25 mapWritable.put(new IntWritable(1), new Text("hadoop")); 26 mapWritable.put(new VIntWritable(2), new BytesWritable(new byte[]{3,5})); 27 28 MapWritable destWritable = new MapWritable(); 29 cloneInto(mapWritable, destWritable); 30 assertThat((Text)destWritable.get(new IntWritable(1)), is(new Text("hadoop"))); 31 32 /* 33 assertThat( ((BytesWritable)destWritable.get(new IntWritable(2))).getLength(), is(2)); 34 ==> 出错,java.lang.NullPointException,说明 MapWritable 是以键的class 类型存储,和实际Writable对象值无关系 35 36 MapWritable构造函数源码如下: 37 38 public Writable put(Writable key, Writable value) { 39 addToMap(key.getClass()); 40 addToMap(value.getClass()); 41 return instance.put(key, value); 42 } 43 44 其中addToMap(class clazz)为父类 AbstractMapWritable 的方法: 45 46 protected synchronized void addToMap(Class clazz) { 47 if (classToIdMap.containsKey(clazz)) { return; } 48 if (newClasses + 1 > Byte.MAX_VALUE) { 49 throw new IndexOutOfBoundsException("adding an additional class would exceed the maximum number allowed"); 50 } 51 byte id = ++newClasses; 52 addToMap(clazz, id); 53 } 54 55 Map<Class, Byte> classToIdMap = new ConcurrentHashMap<Class, Byte>(); 56 Map<Byte, Class> idToClassMap = new ConcurrentHashMap<Byte, Class>(); 57 //继承 AbstractMapWritable的MapWritable和SortedMapWritable最多可以使用127个不同的非标准 Writable 类 58 private volatile byte newClasses = 0; 59 60 private synchronized void addToMap(Class clazz, byte id) { 61 if (classToIdMap.containsKey(clazz)) { 62 byte b = classToIdMap.get(clazz); 63 if (b != id) { 64 throw new IllegalArgumentException ("Class " + clazz.getName() + " already registered but maps to " + b + " and not " + id); 65 } 66 } 67 if (idToClassMap.containsKey(id)) { 68 Class c = idToClassMap.get(id); 69 if (!c.equals(clazz)) { 70 throw new IllegalArgumentException("Id " + id + " exists but maps to " + c.getName() + " and not " + clazz.getName()); 71 } 72 } 73 classToIdMap.put(clazz, id); 74 idToClassMap.put(id, clazz); 75 } 76 77 */ 78 assertThat( ((BytesWritable)destWritable.get(new VIntWritable(2))).getLength(), is(2)); 79 } 80 81 //测试 MapWritable 的序列化过程,序列化过程见 图3.2 82 @Test 83 public void testSerialize() throws IOException{ 84 85 MapWritable mapWritable = new MapWritable(); 86 mapWritable.put(new IntWritable(1), new Text("hadoop")); 87 mapWritable.put(new VIntWritable(2), new BytesWritable(new byte[]{3,5})); 88 89 //000000000285000000018c066861646f6f708e0283000000020305 90 //00, 00000002, 85, 00000001, 8c, 06 6861646f6f70, 8e, 02, 83, 00000002 35 91 System.out.println(serializeToHexString(mapWritable)); 92 } 93 }
MapWritable的序列化过程源码 图3.2:
4 实现定制的 Writable 类型
Hadoop 的大部分 Writable 实现能够满足我们的大部分需求,但是有时为了需求需定制一些新的实现. 有了定制的 Writable,我们可以完全控制二进制的表示和排序顺序,由于Writable 是MapReduce 数据路径
的核心,所以调整二进制表示能对性能产生显著效果. 下面有一个定制的 Writable 类型:
![](https://images.cnblogs.com/OutliningIndicators/ContractedBlock.gif)
![](https://images.cnblogs.com/OutliningIndicators/ExpandedBlockStart.gif)
1 package com.iresearch.hadoop.io; 2 3 import java.io.DataInput; 4 import java.io.DataOutput; 5 import java.io.IOException; 6 7 import org.apache.commons.lang.ArrayUtils; 8 import org.apache.hadoop.io.RawComparator; 9 import org.apache.hadoop.io.Text; 10 import org.apache.hadoop.io.WritableComparable; 11 import org.apache.hadoop.io.WritableComparator; 12 import org.apache.hadoop.io.WritableUtils; 13 14 import com.iresearch.hadoop.io.base.WritableTestBase; 15 16 public class CustomTextWritable implements WritableComparable<CustomTextWritable> { 17 18 private Text first; 19 private Text second; 20 21 public CustomTextWritable(){ 22 set(new Text(), new Text()); 23 } 24 25 public CustomTextWritable(Text first, Text second){ 26 set(first, second); 27 } 28 29 public CustomTextWritable(String first, String second){ 30 set(new Text(first), new Text(second)); 31 } 32 33 public void set(Text first, Text second) { 34 this.first = first; 35 this.second = second; 36 } 37 38 public Text getFirst(){ 39 return first; 40 } 41 42 public Text getSecond(){ 43 return second; 44 } 45 46 public byte[] getBytes(){ 47 return ArrayUtils.addAll(first.getBytes(), second.getBytes()); 48 } 49 50 public int getLength(){ 51 return first.getLength() + second.getLength(); 52 } 53 54 @Override 55 public void write(DataOutput out) throws IOException { 56 first.write(out); 57 second.write(out); 58 } 59 60 @Override 61 public void readFields(DataInput in) throws IOException { 62 first.readFields(in); 63 second.readFields(in); 64 } 65 66 @Override 67 public int hashCode() { 68 return first.hashCode() * 163 + second.hashCode(); 69 } 70 71 @Override 72 public boolean equals(Object obj) { 73 if(obj instanceof CustomTextWritable){ 74 return first.equals( ((CustomTextWritable)obj).first ) && second.equals( ((CustomTextWritable)obj).second ); 75 } 76 return false; 77 } 78 79 @Override 80 public String toString() { 81 return first.toString() + "\t" +second.toString(); 82 } 83 84 @Override 85 public int compareTo(CustomTextWritable other) { 86 int cmp = first.compareTo(other.first); 87 if(cmp != 0){ 88 return cmp; 89 } 90 return second.compareTo(other.second); 91 } 92 93 // the default comparator of CustomTextWritable 94 public static class Comparator extends WritableComparator{ 95 96 private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator(); 97 98 protected Comparator() { 99 super(CustomTextWritable.class); 100 } 101 102 @Override 103 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { 104 105 try { 106 //WritableUtils.decodeVIntSize(b1[s1]) 表示 first数据存储长度 数值的字节长度 107 //readVInt(b1, s1) 表示 first数据存储字节长度 108 int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1); 109 int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2); 110 111 //compare the first field, ERROR ==> int cmp = TEXT_COMPARATOR.compare(b1, s1, l1, b2, s2, l2); 112 int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2); 113 114 if(cmp != 0){ 115 return cmp; 116 } 117 118 // first field is same, then compare the second field. 119 return TEXT_COMPARATOR.compare(b1, s1+firstL1, l1-firstL1, b2, s2+firstL2, l2-firstL2); 120 } catch (IOException e) { 121 throw new IllegalArgumentException(e); 122 } 123 124 } 125 126 } 127 128 static{ 129 WritableComparator.define(CustomTextWritable.class, new Comparator()); 130 } 131 132 //A custom RawComparator for comparing the first field of CustomTextWritable 133 public static class FirstComparator extends WritableComparator{ 134 135 private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator(); 136 137 protected FirstComparator() { 138 super(CustomTextWritable.class); 139 } 140 141 //序列化后,字节的直接比较方法 142 @Override 143 public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) { 144 145 try { 146 int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1); 147 int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2); 148 149 return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2); 150 } catch (Exception e) { 151 throw new IllegalArgumentException(e); 152 } 153 } 154 155 //在非序列化时,对象的比较方法 156 @Override 157 public int compare(WritableComparable a, WritableComparable b) { 158 159 if(a instanceof CustomTextWritable && b instanceof CustomTextWritable){ 160 return ( ((CustomTextWritable)a).first.compareTo( ((CustomTextWritable)b).first) ); 161 } 162 return super.compare(a, b); 163 } 164 } 165 } 166 167 class MainTest extends WritableTestBase{ 168 169 public static void main(String[] args) throws IOException { 170 171 CustomTextWritable writableA = new CustomTextWritable("hadoop","hive"); 172 CustomTextWritable writableB = new CustomTextWritable("hadoop","hive"); 173 174 @SuppressWarnings("unchecked") 175 RawComparator<CustomTextWritable> comparator = WritableComparator.get(CustomTextWritable.class); 176 //int compare = comparator.compare(writableA, writableB); 177 178 byte[] bytesA = serialize(writableA); 179 byte[] bytesB = serialize(writableB); 180 int compare = comparator.compare(bytesA, 0, bytesA.length, bytesB, 0, bytesB.length); 181 182 System.out.println(signum(compare)); 183 184 } 185 186 public static int signum(int a){ 187 return (a<0)? -1 : ( (a==0)?0:1 ); 188 } 189 }
参考资料:
[1]Hadoop权威指南中文版第二版
[2]Hadoop源代码分析 修订版 [张鑫著][2014.07]