common实现
序列化
public interface Writable {
/**
* 输出(序列化)对象到流中
* @param out DataOuput流,序列化的结果保存在流中
* @throws IOException
*/
void write(DataOutput out) throws IOException;
/**
* 从流中读取(反序列化)对象
* 为了效率,请尽可能复用现有的对象
* @param in DataInput流,从该流中读取数据
* @throws IOException
*/
void readFields(DataInput in) throws IOException;
}
效率在Hadoop中非常重要,因此HadoopI/O包中提供了具有高效比较能力的RawComparator接口。
RawComparator接口允许执行者比较流中读取的未被反序列化为对象的记录,从而省去了创建对象的所有开销。其中,compare()比较时需要的两个参数所对应的记录位于字节数组b1和b2的指定开始位置s1和s1,记录长度为l1和l2,代码如下:
public interface RawComparator<T>extends
Comparator<T> {
public int compare(byte[] b1,int s1,int
l1,byte[] b2,int s2,int l2);
}
writerable详解
objectwriter实现:
public class ObjectWritable implements Writable,Configurable {
private Class declaredClass;//保存于ObjectWritable的对象对应的类对象
private Object instance;//被保留的对象
private Configuration conf;
public ObjectWritable() {}
public ObjectWritable(Object instance) {
set(instance);
}
public ObjectWritable(Class declaredClass,Object instance) {
this.declaredClass=declaredClass;
this.instance=instance;
}
……
public void readFields(DataInput in) throws IOException {
readObject(in,this,this.conf);
}
public void write(DataOutput out) throws IOException {
writeObject(out,instance,declaredClass,conf);
}
……
public static void writeObject(DataOutput out,Object instance,
Class declaredClass,Configuration conf) throws……{
if(instance==null) {//空
instance=new NullInstance(declaredClass,conf);
declaredClass=Writable.class;
}
//写出declaredClass的规范名
UTF8.writeString(out,declaredClass.getName());
if (declaredClass.isArray()) {//数组
……
} else if(declaredClass==String.class) {//字符串
……
} else if(declaredClass.isPrimitive()) {//基本类型
if(declaredClass==Boolean.TYPE) {//boolean
out.writeBoolean(((Boolean)instance).booleanValue());
} else if(declaredClass==Character.TYPE) {//char
……
}
} else if(declaredClass.isEnum()) {//枚举类型
……
} else if(Writable.class.isAssignableFrom(declaredClass)) {
//Writable的子类
UTF8.writeString(out,instance.getClass().getName());
((Writable)instance).write(out);
} else {
……
}
public static Object readObject(DataInput in,
ObjectWritable objectWritable,Configuration conf){
……
Class instanceClass=null;
……
Writable writable=WritableFactories.newInstance(instanceClass,conf);
writable.readFields(in);
instance=writable;
……
}
}
Hadoop序列化框架
□Avro是一个数据序列化系统,用于支持大批量数据交换的应用。它的主要特点有:支持二进制序列化方式,可以便捷、快速地处理大量数据;动态语言友好,Avro提供的机制使动态语言可以方便地处理Avro数据。
□Thrift是一个可伸缩的、跨语言的服务开发框架,由Facebook贡献给开源社区,是Facebook的核心框架之一。基于Thrift的跨平台能力封装的Hadoop文件系统Thrift API(参考contrib的thriftfs模块),提供了不同开发语言开发的系统访问HDFS的能力。□Google Protocol Buffer是Google内部的混合语言数据标准,提供了一种轻便高效的结构化数据存储格式。目前,Protocol Buffers提供了C++、Java、Python三种语言的API,广泛应用于Google内部的通信协议、数据存储等领域中。
Serialization是一个接口,使用抽象工厂的设计模式,提供了一系列和序列化相关并相互依赖对象的接口。
public interface Serialization<T> {
//客户端用于判断序列化实现是否支持该类对象
boolean accept(Class<?> c);
//获得用于序列化对象的Serializer实现
Serializer<T> getSerializer(Class<T> c);
//获得用于反序列化对象的Deserializer实现
Deserializer<T> getDeserializer(Class<T> c);
}
public interface Serializer<T> {
//为输出(序列化)对象做准备
void open(OutputStream out) throws IOException;
//将对象序列化到底层的流中
void serialize(T t) throws IOException;
//序列化结束,清理
void close() throws IOException;
}
Hadoop压缩
public interface CompressionCodec {
//在底层输出流out的基础上创建对应压缩算法的压缩流CompressionOutputStream对象
CompressionOutputStream createOutputStream(OutputStream out)……
//使用压缩器compressor,在底层输出流out的基础上创建对应的压缩流
CompressionOutputStream createOutputStream(OutputStream out,
Compressor compressor) ……
……
//创建压缩算法对应的压缩器
Compressor createCompressor();
//在底层输入流in的基础上创建对应压缩算法的解压缩流CompressionInputStream对象
CompressionInputStream createInputStream(InputStream in) ……
……
//获得压缩算法对应的文件扩展名
String getDefaultExtension();
}
- 压缩器和解压器
CompressionOutputStream的子类实现。相关代码如下:
public abstract class CompressionOutputStream extends OutputStream {
//输出压缩结果的流
protected final OutputStream out;
//构造函数
protected CompressionOutputStream(OutputStream out) {
this.out=out;
}
public void close() throws IOException {
finish();
out.close();
}
public void flush() throws IOException {
out.flush();
}
public abstract void write(byte[] b,int off,int len) throws IOException;
public abstract void finish() throws IOException;
public abstract void resetState() throws IOException;
}
CompressorStream使用压缩器实现了一个通用的压缩流,其主要代码如下:
public class CompressorStream extends CompressionOutputStream {
protected Compressor compressor;
protected byte[] buffer;
protected boolean closed=false;
//构造函数
public CompressorStream(OutputStream out,
Compressor compressor,int bufferSize) {
super(out);
……//参数检查,略
this.compressor=compressor;
buffer=new byte[bufferSize];
}
……
public void write(byte[] b,int off,int len) throws IOException {
//参数检查,略
……
compressor.setInput(b,off,len);
while(!compressor.needsInput()) {
compress();
}
}
protected void compress() throws IOException {
int len=compressor.compress(buffer,0,buffer.length);
if (len > 0) {
out.write(buffer,0,len);
}
}
//结束输入
public void finish() throws IOException {
if (!compressor.finished()) {
compressor.finish();
while(!compressor.finished()) {
compress();
}
}
}
……
//关闭流
public void close() throws IOException {
if(!closed) {
finish();//结束压缩
out.close();//关闭底层流
closed=true;
}
}
……
}
- Java本地方法
数据压缩往往是计算密集型的操作,考虑到性能,建议使用本地库(Native Library)来压缩和解压。在某个测试中,与Java实现的内置gzip压缩相比,使用本地gzip压缩库可以将解压时间减少50%,而压缩时间大概减少10%。
public class SnappyCompressor implements Compressor {
……
private native static void initIDs();
private native int compressBytesDirect();
}
要想实现这两个本地方法,一般需要如下三个步骤:
1)为方法生成一个在Java调用和实际C函数间转换的C存根;
2)建立一个共享库并导出该存根;
3)使用System.loadLibrary()方法通知Java运行环境加载共享库。JDK为C存根的生成提供了实用程序javah,以上面SnappyCompressor为例,可以在build/classes目录下执行如下命令:
javah org.apache.hadoop.io.compress.snappy.SnappyCompressor
系统会生成一个头文件org_apache_hadoop_io_compress_snappy_SnappyCompressor.h。
* DO NOT EDIT THIS FILE– it is machine generated */
#include<jni.h>
/* Header for class org_apache_hadoop_io_compress_snappy_SnappyCompressor */
#ifndef_Included_org_apache_hadoop_io_compress_snappy_SnappyCompressor
#define_Included_org_apache_hadoop_io_compress_snappy_SnappyCompressor
#ifdef__cplusplus
extern "C" {
#endif
#undef
org_apache_hadoop_io_compress_snappy_SnappyCompressor_DEFAULT_DIRECT_BUFFER_SIZE
#define
org_apache_hadoop_io_compress_snappy_SnappyCompressor_DEFAULT_DIRECT_BUFFER_
SIZE 65536L
/*
* Class: org_apache_hadoop_io_compress_snappy_SnappyCompressor
* Method: initIDs
* Signature:()V
*/
JNIEXPORT void JNICALL Java_org_apache_hadoop_io_compress_snappy_SnappyCompressor_initIDs
(JNIEnv *,jclass);
/*
* Class: org_apache_hadoop_io_compress_snappy_SnappyCompressor
* Method: compressBytesDirect
* Signature:()I
*/
JNIEXPORT jint JNICALL Java_org_apache_hadoop_io_compress_snappy_
SnappyCompressor_compressBytesDirect
(JNIEnv *,jobject);
#ifdef__cplusplus
}
#endif
#endif
压缩部分对应的C源代码是SnappyCompressor.c,在这里只介绍Java_…_compressBytesDirect的实现,代码如下:
……
static jfieldID SnappyCompressor_clazz;
……
static jfieldID SnappyCompressor_directBufferSize;
……
static snappy_status(*dlsym_snappy_compress)(constchar*,size_t,char*,size_t*);
……
JNIEXPORT jint JNICALL Java_org_apache_hadoop_io_compress_snappy_SnappyCompressor_
compressBytesDirect(JNIEnv *env,jobject thisj)
{
//获得SnappyCompressor的相关成员变量
jobject clazz=(*env)->GetStaticObjectField
(env,thisj,SnappyCompressor_clazz);
jobject uncompressed_direct_buf=(*env)->GetObjectField
(env,thisj,SnappyCompressor_uncompressedDirectBuf);
jint uncompressed_direct_buf_len=(*env)->GetIntField
(env,thisj,SnappyCompressor_uncompressedDirectBufLen);
jobject compressed_direct_buf=(*env)->GetObjectField
(env,thisj,SnappyCompressor_compressedDirectBuf);
jint compressed_direct_buf_len=(*env)->GetIntField
(env,thisj,SnappyCompressor_directBufferSize);
//获得未压缩数据缓冲区
LOCK_CLASS(env,clazz,"SnappyCompressor");
const char* uncompressed_bytes=(const
char*)
(*env)->GetDirectBufferAddress(env,uncompressed_direct_buf);
UNLOCK_CLASS(env,clazz,"SnappyCompressor");
if(uncompressed_bytes==0) {
return(jint)0;
}
//获得保存压缩结果的缓冲区
……
//进行数据压缩
snappy_status ret=dlsym_snappy_compress(uncompressed_bytes,
uncompressed_direct_buf_len,
compressed_bytes,
&compressed_direct_buf_len);
//处理返回结果
if(ret !=SNAPPY_OK){
THROW(env,"Ljava/lang/InternalError",
"Could not compress data.Buffer length is too small.");
}
(*env)->SetIntField
(env,thisj,SnappyCompressor_uncompressedDirectBufLen,0);
return(jint)compressed_direct_buf_len;
}
JNIEnv提供了C代码和Java虚拟机通信的环境,Java_…_compressBytesDirect方法执行过程中需要获得SnappyCompressor类中的一些成员变量,就需要使用JNIEnv提供的方法。
GetObjectField()函数可用于获得对象的一个域,在上述代码中,使用该方法获得了保存待压缩数据缓冲区和压缩数据写入的缓冲区的成员变量,即SnappyCompressor的成员变量uncompressedDirectBuf和compressedDirectBuf,接着使用JNIEnv的GetDirectBufferAddress()方法获得缓冲区的地址,这样,就可以直接访问数据缓冲区中的数据了。JNIEnv还提供GetIntField()函数,可用于得到Java对象的整型成员变量,而SetIntField()函数则相反,它设置Java对象的整型成员变量的值。
Java应用需要显式通知Java运行环境加载相关的动态库(如加载Snappy的动态库),可用如下代码(细节请参考类LoadSnappy的实现):
public class LoadSnappy {
try {
System.loadLibrary("snappy");
LOG.warn("Snappy native library is available");
AVAILABLE=true;
} catch(UnsatisfiedLinkError ex) {
//NOP
}
……