序列化
序列化(serialization)是指将结构化对象转化为字节流以方便在网络上传输或保存在磁盘上长久存储的过程。反序列化(deserialization)指将字节流转化为结构化对象的过程。序列化在分布式数据处理的进程间通信和永久存储两大领域经常出现。
在Hadoop中,系统中多个节点上进程间的通信是通过RPC(remote procedure call,远程过程调用)来实现的。RPC协议将消息序列化成二进制流后发送给远程节点,远程节点直接将二进制流反序列化成原始消息。通常情况下,RPC序列化格式如下:
- 紧凑 :紧凑格式能从分利用互联网带宽(数据中心中最稀缺的就是网络带宽)
- 快速 :进程间的通信形成了分布式系统的骨架,所以要尽量减少序列化和反序列化的开销。
- 可扩展 :为了满足新的需求,协议不断变化。所以在控制客户端和服务器的进程中需要直接引进相应的协议
- 支持互操作 :某些时候可能需要使用不同的语言来完成客户端到服务端的交互,所以需要设计一种特定的格式来满足不同语言之间的数据交换。
相较于RPC,永久存储的数据更需要满足此四大特性。从使用时间周期上来说,RPC的存货时间一般不到1s,而永久存储的数据可能几年后才会再次被读取。对于持久化存储数据来说,其存储格式应该紧凑-高效的利用存储空间、快速-读写数据的开销较小、可扩展-可以透明的读取老格式的数据、可相互操作-可以使用不同的语言读写持久存储的数据。 Hadoop中系统自带的序列化格式是 Writable,它非常紧凑而且速度快,但不太容易进行互操作(java之外的语言很难进行扩展和使用)。
1. Writable 接口
Writable 接口定义了两个方法:一个是将其状态写入二进制文件流,另一个是从二进制文件流中读取状态:
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.io;
import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
/**
* A serializable object which implements a simple, efficient, serialization
* protocol, based on {[@link](https://my.oschina.net/u/393) DataInput} and {[@link](https://my.oschina.net/u/393) DataOutput}.
*
* <p>Any <code>key</code> or <code>value</code> type in the Hadoop Map-Reduce
* framework implements this interface.</p>
*
* <p>Implementations typically implement a static <code>read(DataInput)</code>
* method which constructs a new instance, calls {[@link](https://my.oschina.net/u/393) #readFields(DataInput)}
* and returns the instance.</p>
*
* <p>Example:</p>
* <p><blockquote><pre>
* public class MyWritable implements Writable {
* // Some data
* private int counter;
* private long timestamp;
*
* public void write(DataOutput out) throws IOException {
* out.writeInt(counter);
* out.writeLong(timestamp);
* }
*
* public void readFields(DataInput in) throws IOException {
* counter = in.readInt();
* timestamp = in.readLong();
* }
*
* public static MyWritable read(DataInput in) throws IOException {
* MyWritable w = new MyWritable();
* w.readFields(in);
* return w;
* }
* }
* </pre></blockquote></p>
*/
@InterfaceAudience.Public
@InterfaceStability.Stable
public interface Writable {
/**
* Serialize the fields of this object to <code>out</code>.
*
* @param out <code>DataOuput</code> to serialize this object into.
* @throws IOException
*/
void write(DataOutput out) throws IOException;
/**
* Deserialize the fields of this object from <code>in</code>.
*
* <p>For efficiency, implementations should attempt to re-use storage in the
* existing object where possible.</p>
*
* @param in <code>DataInput</code> to deseriablize this object from.
* @throws IOException
*/
void readFields(DataInput in) throws IOException;
}
我们通过 Writable 接口的一个常用的实现类 IntWritable ,可以通过这个类来封装 java int 类型。 IntWritable 类实现源码如下:
/**
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.hadoop.io;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
/** A WritableComparable for ints. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class IntWritable implements WritableComparable<IntWritable> {
private int value;
public IntWritable() {}
public IntWritable(int value) { set(value); }
/** Set the value of this IntWritable. */
public void set(int value) { this.value = value; }
/** Return the value of this IntWritable. */
public int get() { return value; }
@Override
public void readFields(DataInput in) throws IOException {
value = in.readInt();
}
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(value);
}
/** Returns true iff <code>o</code> is a IntWritable with the same value. */
@Override
public boolean equals(Object o) {
if (!(o instanceof IntWritable))
return false;
IntWritable other = (IntWritable)o;
return this.value == other.value;
}
@Override
public int hashCode() {
return value;
}
/** Compares two IntWritables. */
@Override
public int compareTo(IntWritable o) {
int thisValue = this.value;
int thatValue = o.value;
return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
@Override
public String toString() {
return Integer.toString(value);
}
/** A Comparator optimized for IntWritable. */
public static class Comparator extends WritableComparator {
public Comparator() {
super(IntWritable.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
int thisValue = readInt(b1, s1);
int thatValue = readInt(b2, s2);
return (thisValue<thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
}
static { // register this comparator
WritableComparator.define(IntWritable.class, new Comparator());
}
}
IntWritable 实现了原始的 WritableComparable 接口,该接口继承自 Writable 接口和 java.lang.Comparator 接口
public interface WritableComparable<T> extends Writable, Comparable<T> {
}
因为MapReduce中间有个基于键的排序阶段,所以类型比较就显得非常重要,为此Hadoop提供了RawComparator 接口用于直接比较流中的数据。
org.apache.hadoop.io.WritableComparator
public int compare(byte[] b1,
int s1,
int l1,
byte[] b2,
int s2,
int l2)
Optimization hook. Override this to make SequenceFile.Sorter's scream.
The default implementation reads the data into two WritableComparables (using Writable.readFields(DataInput), then calls compare(WritableComparable,WritableComparable).
Specified by:
compare in interface RawComparator
Parameters:
b1 - The first byte array.
s1 - The position index in b1. The object under comparison's starting index.
l1 - The length of the object in b1.
b2 - The second byte array.
s2 - The position index in b2. The object under comparison's starting index.
l2 - The length of the object under comparison in b2.
Returns:
An integer result of the comparison.
WritableComparator 是对实现了 WritableComparable 接口的类 创建实现 RowComparator 接口的类的一个通用的实现。它主要提供了对 compare() 的一个默认实现,用于反序列化需要在流中比较的对象,并调用相应的对象的compare()方法。还提供了创建实现 RawComparator 接口的对向工厂的方法(这些对象是线程绑定的)。
2. Writable 接口的实现类
Writable 类简单层次结构图 Writable 类详细层次结构图
Writable 类对java的基本类型都提供了封装(char除外),每个基本类型的封装类都有 get() 和 set() 两个方法,用于获取或设置封装的值。 Text 是使用标准 UTF-8 字符集编码的 Writable 类,Text 类使用整型(通过边长编码的方式)来存储字符串编码中所需的字节数,因此其最大值为 2GB 。 Text的索引
正是因为使用了标准的UTF-8进行编码,Text 类和java.lang.String类之间就存在了很大的不同。Text类的索引是根据编码后字节序列中的位置来实现的,并非字符串中的Unicode字符,也不是java char的编码单元。 BytesWritable 是对二进制数据数组的封装。它的序列化格式为一个表示数组长度的整数(4个字节)加上数据内容本身。BytesWritable是可变的,其值可以通过set()方法设置。 NullWritable 是 Writable 的特殊类型,它的序列化长度为0,只充当占位符使用。 ObjectWritable 是对java基本类型的封装(String, enum, Writable, null或这些类型组成的数组)。它在 Hadoop RPC 中用于对方法的参数和返回类型进行封装和接封装。 当一个字段中包含多个类型时,ObjectWritable 非常有用:例如如果SequenceFile中的值包含多个类型,就可以将值类型声明为ObjectWritable,并将每个类型封装在一个ObjectWritable中。作为一个通用的机制,每次序列化都写封装类型的名称实在太浪费空间。如果封装的类型比较少且能够提前知道,那么可以使用静态类型的数组,并使用对序列化后的类型的引用加入位置索引来提高性能(GenericWritable使用这种方式)。
3. Writable集合类
org.apache.hadoop.io 软件包里一共由6个Writable集合类: ArrayWritable ArrayPrimitiveWritable TwoArrayWritable MapWritable SortedWritable EnumMapWritable
ArrayWritable 和 TwoArrayWritable是对Writable的数组和二维数组的实现。这两个类中的所有的元素必须是同一个类的实例,两个类都有 get(),set()和toArray()方法。toArray()方法新建该数组的一个浅拷贝(shallow copy)。 ArrayPrimitiveWritable是对java基本数据类型的一个封装。调用set()方法时可以识别相应组件类型,因此无需通过继承此类来设置类型。 MapWritable和SortedMapWritable分别实现了java.until.Map<Writable,Writable>和java.until.SortedMap<WritableComparable,Writable>。每个键值字段使用的类型是相应的字段序列化形式的一部分,类型存储为单个字符。在org.apache.hadoop.io包中数组经常与标准类型结合使用,而定制Writable类型也经常结合使用,但对于非标准类型,则需要在包头指明使用的数组类型。根据实现MapWritable和SortedMapWritable通过正byte值来指示定制的类的类型,所以在MapWritable和SortedMapWritable实例中最多可以使用127个不同的非标准Writable类。
实现自定义的 Writable 集合
Hadoop 有一套非常有用的Writable实现可以满足大部分需求,可在某些情况下我们需要需要根据自己的需求构造一个新的Writable实现。有了定制的Writable类型就能完全控制二进制的表示和排序顺序。由于Writable是MapReduce的数据路径的核心,所以调整二进制表示能对性能产生显著的效果。
例一: 一对字符串的序列化操作实现
package com.weiwei.WHadoop.io;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/**
* @author WangWeiwei
* @version 1.0
* @sine 17-3-8
* 一对字符串的序列化操作实现
*/
public class TextPair implements WritableComparable<TextPair>{
private Text first;
private Text second;
public TextPair(){
set(new Text(),new Text());
}
public TextPair(String s1,String s2){
set(new Text(s1),new Text(s2));
}
public Text getFirst() {
return first;
}
public void setFirst(Text first) {
this.first = first;
}
public Text getSecond() {
return second;
}
public void setSecond(Text second) {
this.second = second;
}
public TextPair(Text text, Text text2){
set(text,text2);
}
private void set(Text text, Text text1) {
this.first = text;
this.second = text1;
}
/**
* Compares this object with the specified object for order. Returns a
* negative integer, zero, or a positive integer as this object is less
* than, equal to, or greater than the specified object.
* <p>
* <p>The implementor must ensure <tt>sgn(x.compareTo(y)) ==
* -sgn(y.compareTo(x))</tt> for all <tt>x</tt> and <tt>y</tt>. (This
* implies that <tt>x.compareTo(y)</tt> must throw an exception iff
* <tt>y.compareTo(x)</tt> throws an exception.)
* <p>
* <p>The implementor must also ensure that the relation is transitive:
* <tt>(x.compareTo(y)>0 && y.compareTo(z)>0)</tt> implies
* <tt>x.compareTo(z)>0</tt>.
* <p>
* <p>Finally, the implementor must ensure that <tt>x.compareTo(y)==0</tt>
* implies that <tt>sgn(x.compareTo(z)) == sgn(y.compareTo(z))</tt>, for
* all <tt>z</tt>.
* <p>
* <p>It is strongly recommended, but <i>not</i> strictly required that
* <tt>(x.compareTo(y)==0) == (x.equals(y))</tt>. Generally speaking, any
* class that implements the <tt>Comparable</tt> interface and violates
* this condition should clearly indicate this fact. The recommended
* language is "Note: this class has a natural ordering that is
* inconsistent with equals."
* <p>
* <p>In the foregoing description, the notation
* <tt>sgn(</tt><i>expression</i><tt>)</tt> designates the mathematical
* <i>signum</i> function, which is defined to return one of <tt>-1</tt>,
* <tt>0</tt>, or <tt>1</tt> according to whether the value of
* <i>expression</i> is negative, zero or positive.
*
* @param o the object to be compared.
* @return a negative integer, zero, or a positive integer as this object
* is less than, equal to, or greater than the specified object.
* @throws NullPointerException if the specified object is null
* @throws ClassCastException if the specified object's type prevents it
* from being compared to this object.
*/
@Override
public int compareTo(Object o) {
int cmp = first.compareTo(p.first);
if (cmp != 0){
return cmp;
}
return second.compareTo(p.second);
}
/**
* Serialize the fields of this object to <code>out</code>.
*
* @param out <code>DataOuput</code> to serialize this object into.
* @throws IOException
*/
@Override
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
}
/**
* Deserialize the fields of this object from <code>in</code>.
* <p>
* <p>For efficiency, implementations should attempt to re-use storage in the
* existing object where possible.</p>
*
* @param in <code>DataInput</code> to deseriablize this object from.
* @throws IOException
*/
@Override
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
TextPair textPair = (TextPair) o;
if (first != null ? !first.equals(textPair.first) : textPair.first != null) return false;
return second != null ? second.equals(textPair.second) : textPair.second == null;
}
@Override
public String toString() {
return "TextPair{" +
"first=" + first +
", second=" + second +
'}';
}
@Override
public int hashCode() {
int result = first != null ? first.hashCode() : 0;
result = 31 * result + (second != null ? second.hashCode() : 0);
return result;
}
}
由于MapReduce中的默认分区 HashPartitioner 通常使用hashCode() 方法来选择Reduce 分区,所以应该有一个合适的哈希函数来保证每个 Reduce 分区的大小相似。TextPair 实现了 WritableComparable 接口, 提供了 CompareTo() 方法,该方法可以强制数据排序:先按照地一个字符串排序,如果第一个字符串相同,则按照第二个字符串排序。