HDFS-序列化

最新推荐文章于 2022-11-02 11:00:13 发布

iteye_19462

最新推荐文章于 2022-11-02 11:00:13 发布

阅读量248

点赞数

分类专栏： hadoop hdfs 文章标签： java json 大数据

本文链接：https://blog.csdn.net/iteye_19462/article/details/82359357

版权

hadoop 同时被 2 个专栏收录

42 篇文章 0 订阅

订阅专栏

hdfs

6 篇文章 0 订阅

订阅专栏

序列化

序列化是把结构化的对像转为字节流，以便网络传输或存储到磁盘设备上。反序列化是一个相反的过程，即把字节流转变为一系列的结构化对象。

RPC序列化建议的特性

1.紧凑(Compact)即方便网络传输,充分利用存储空间

2.快速（Fast)即序列化及反序列化性能要好

3.扩展性(Extensible)即协议有变化，可以支持新的需求

4.互操作性（Interoperable）即客户端及服务器端不依赖语言的实现

Hadoop使用Writables,满足紧凑、快速，不满足扩展能及互操作性

Writable 接口

package org.apache.hadoop.io;
import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;

public interface Writable {
    void write(DataOutput out) throws IOException;
    void readFields(DataInput in) throws IOException;
}

package com.bigdata.io;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.IOException;

import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Writable;

public final class WritableHelper {

	public static byte[] serialize(Writable writable) throws IOException{
		ByteArrayOutputStream out = new ByteArrayOutputStream();
		DataOutputStream dataOut = new DataOutputStream(out);
		writable.write(dataOut);
		dataOut.close();
		return out.toByteArray();
	}
	
	public static byte[] deserialize(Writable writable , byte[] bytes) throws IOException{
		ByteArrayInputStream in = new ByteArrayInputStream(bytes);
		DataInputStream dataIn = new DataInputStream(in);
		writable.readFields(dataIn);
		dataIn.close();
		return bytes;
	}
	
	public static void main(String[] args) throws IOException {
		IntWritable writable = new IntWritable();
		writable.set(163);
		byte[] bytes = serialize(writable);
		System.out.println(bytes.length+"," + Bytes.toInt(bytes));
		deserialize(writable, bytes);
		System.out.println(bytes.length+"," + Bytes.toInt(bytes));
	}
}

WritableComparable and comparators

package org.apache.hadoop.io;
public interface WritableComparable<T> extends Writable,Comparable<T> {
}

Hadoop优化比对，不需要反序列化即可比较

package org.apache.hadoop.io;
import java.util.Comparator;
public interface RawComparator<T> extends Comparator<T> {
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}

WritableComparator 是一个RawComparator通用的实现，为WritableComparable classes.

它做了两件事

1.实现了compare()方法(返序列化）

2.做为RawComparator的工厂类

package com.bigdata.io;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.IOException;

import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparator;

public final class WritableHelper {

	public static byte[] serialize(Writable writable) throws IOException{
		ByteArrayOutputStream out = new ByteArrayOutputStream();
		DataOutputStream dataOut = new DataOutputStream(out);
		writable.write(dataOut);
		dataOut.close();
		return out.toByteArray();
	}
	
	public static byte[] deserialize(Writable writable , byte[] bytes) throws IOException{
		ByteArrayInputStream in = new ByteArrayInputStream(bytes);
		DataInputStream dataIn = new DataInputStream(in);
		writable.readFields(dataIn);
		dataIn.close();
		return bytes;
	}
	
	public static void main(String[] args) throws IOException {
		IntWritable writable = new IntWritable();
		writable.set(163);
		byte[] bytes = serialize(writable);
		System.out.println(bytes.length+"," + Bytes.toInt(bytes));
		deserialize(writable, bytes);
		System.out.println(bytes.length+"," + Bytes.toInt(bytes));
		RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);
		IntWritable w1 = new IntWritable(163);
		IntWritable w2 = new IntWritable(67);
		int result = comparator.compare(w1, w2);
		System.out.println(result);
		
		byte[] b1 = serialize(w1);
		byte[] b2 = serialize(w2);
		result = comparator.compare(b1, 0, b1.length, b2, 0, b2.length);
		System.out.println(result);
	}
}

Writable class

ArrayWritable

TwoDArrayWritable

MapWritable

SortedMapWritable

BooleanWritable

ByteWritable

IntWritable

VIntWritable

FloatWritable

LongWritable

VLongWritable

DoubleWritable

NullWritable

Text

BytesWritable

MD5Hash

ObjectWrtiable

GenericWritable

Java primitive	Writable Implementation	Serialized size(bytes)
boolean	BooleanWritable	1
byte	ByteWritable	1
short	ShortWritable	2
int	IntWritable	4
	VIntWritable	1-5
float	FloatWritable	4
long	LongWritable	8
	VLongWritable	1-9
double	DoubleWritable	8

Text 比较像Java里的String，最大可以存放2GB数据。

支持charAt(),find()-像String里的indexOf方法

序列化框架

AVRO

Avro Schema 通常使用JSON格式定义

Avro 数据类型及Schemas

Avro primitive types

Type	Description	Schema
null	The absence of a value	"null"
boolean	A binary value	"boolean"
int	32-bit singed integer	"int"
long	64-bit singed integer	"long"
float	Single precision(32-bit) IEEE 754 floating-point number	"float"
double	Double precision(64-bit) IEEE 754 floating-point number	"double"
bytes	Sequence of 8-bit unsigned bytes	"bytes"
string	Sequence of Unicode characters	"string"

Avro 复杂类型

Type	Description	Schema example
array	An ordered collection of objects. All objects in a particular array must have the same schema.	{ "type":"array", "items":"long" }
map	An unordered collection of key-value pairs.Keys must be strings, values may be any type, although within a particular map all values must have the same schema.	{ "type":"map", "values":"string"
record	A collection of named fields of any type.	{ "type":'record", "name": "WeatherRecord", "doc":"A weather reading.", "fields":[ {"name":"year","type":"int"}, {"name":"temperature","type":"int"}, {"name":"stationId","type":"string"} ] }
enum	A set of named values.	{ "type":"enum", "name":"Cultery", "doc":"An eating utensil.", "symbols":["KNIFE","FORK","SPOON"] }
fixed	A fixed number of 8-bit unsigned bytes.	{ "type":"fixed", "name":"Md5Hash", "size":16
union	A union of schemas. A union is represented by a JSON array,where each element in the arry is a schema.Data represented by a union must match one of th eschemas in the union.	[ "null", "string", {"type":"map","values":"string"} ]

package com.bigdata.io.avro;

import java.io.ByteArrayOutputStream;
import java.io.IOException;

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.io.Encoder;
import org.apache.avro.io.EncoderFactory;

public class StringPair {

	/**
	 * @param args
	 * @throws IOException 
	 */
	public static void main(String[] args) throws IOException {
		Schema.Parser parser = new Schema.Parser();
		Schema schema = parser.parse(StringPair.class.getResourceAsStream("/StringPair.avsc"));
		
		//We can create an instance of an Avro record using the generic API as follows
		GenericRecord datum = new GenericData.Record(schema);
		datum.put("left", "L");
		datum.put("right", "R");
		
		// we serialize the record to an output stream
		ByteArrayOutputStream out = new ByteArrayOutputStream();
		DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
		Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
		writer.write(datum, encoder);
		encoder.flush();
		out.close();
		
		DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema);
		Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
		GenericRecord result = reader.read(null, decoder);
		String r1 = result.get("left").toString();
		String r2 = result.get("right").toString();
		System.out.println(r1+ ","+r2);
	}

}

{
	"type": "record",
	"name":	"StringPair",
	"doc":	"A pair of strings.",
	"fields": [
		{"name":"left", "type": "string"},
		{"name":"right", "type": "string"}
	]
}

Avro 数据文件

数据文件包含Metadata（Avro Schema、sync marker)及一系列Block,Block包含序列化Avro对象。

Block用sync marker分隔。Avro数据文件可以被分割。适合MapReduce处理。

Avro读写例子

package com.bigdata.io.avro;

import java.io.File;
import java.io.IOException;

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;

public class AvroWriteToFile {

	public static void main(String[] args) throws IOException {
		Schema.Parser parser = new Schema.Parser();
		Schema schema = parser.parse(AvroWriteToFile.class.getResourceAsStream("/StringPair.avsc"));
		GenericRecord datum = new GenericData.Record(schema);
		datum.put("left", "L");
		datum.put("right", "R");
		
		File file = new File("data.avro");
		DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
		DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
		dataFileWriter.create(schema, file);
		dataFileWriter.append(datum);
		datum.put("left", "is left");
		datum.put("right", "is right");
		dataFileWriter.append(datum);
		dataFileWriter.close();
		
		DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema);
		DataFileReader<GenericRecord> fileReader = new DataFileReader<GenericRecord>(file, reader);
		GenericRecord result = null;
		for (GenericRecord record : fileReader) {
			System.out.println(record.get("left")+","+record.get("right"));
		}
		fileReader.sync(0);
		System.out.println(fileReader.getBlockCount());
		while(fileReader.hasNext()){
			result = fileReader.next();
			System.out.println(result.get("left")+","+result.get("right"));
		}
		fileReader.close();
	}
}

StringPair.avsc 增加description,一定要设置默认值，这样原来的规范也可以使用此Schema，同时新的规范也可用。

{
	"type": "record",
	"name":	"StringPair",
	"doc":	"A pair of strings.",
	"fields": [
		{"name":"left", "type": "string"},
		{"name":"right", "type": "string"},
		{"name": "description", "type": ["null", "string"], "default": null}
	]
}