前言
Avro 是 Hadoop 的一个子项目,是Hadoop的创始人Doug Cutting(也是Lucene,Nutch等项目的创始人,膜拜)牵头开发。
Avro是一个数据序列化系统,设计用于支持大批量数据交换的应用。
它的主要特点有:支持二进制序列化方式,可以便捷,快速地处理大量数据;动态语言友好,Avro提供的机制使动态语言可以方便地处理Avro数据。
其他的序列化系统:
- Google的Protocol Buffers
- Facebook的Thrift
- hessian
avro 参考资料:
- http://avro.apache.org/docs/current/spec.html 官方文档
- https://blog.cloudera.com/blog/2009/11/avro-a-new-format-for-data-interchange/ 大牛的文章
- http://www.trieuvan.com/ avro-tools.jar 镜像地址
-
Avro提供了RPC机制,可以不需要生成额外的API代码即可使用Avro来存储数据和RPC交互,“代码生成”是可选的,这一点区别于protobuf和thrift。此外Hadoop平台上的多个项目正在使用(或者支持)Avro作为数据序列化的服务。
-
Avro尽管提供了RPC机制,事实上Avro的核心特性决定了它通常用在“大数据”存储场景(Mapreduce),即我们通过借助schema将数据写入到“本地文件”或者HDFS中,然后reader再根据schema去迭代获取数据条目。它的schema可以有限度的变更、调整,而且Avro能够巧妙的兼容,这种强大的可扩展性正是“文件数据”存储所必须的。
-
avro是基于schema(模式),这和protobuf、thrift没什么区别,在schema文件中(.avsc文件)中声明数据类型或者protocol(RPC接口),那么avro在read、write时将依据schema对数据进行序列化。因为有了schema,那么Avro的读、写操作将可以使用不同的平台语言。Avro的schema是JSON格式,所以编写起来也非常简单、可读性很好。目前Avro所能支持的平台语言并不是很多,其中包括JAVA、C++、Python。
-
当Avro将数据写入文件时,将会把schema连同实际数据一同存储,此后reader将可以根据这个schema处理数据,如果reader使用了不同的schema,那么Avro也提供了一些兼容机制来解决这个问题。
在RPC中使用Avro,Client和server端将会在传输数据之前,首先通过handshake交换Schema,并在Schema一致性上达成统一。
Java 示例
首先建立 user.avsc 文件
user.avsc
{"namespace": "com.beng.rpc.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
上述schema表示User类有三个field:“name”、“age”、“email”;“type”用来声明field的数据类型,比如“email”的type为“[“string”,“null”]”,则表示类型可以为“string”或者为null。
使用 avro-tools.jar 对其进行编译
java -jar avro-tools-1.7.7.jar compile schema user.avsc .
将编译后的文件编译当前文件夹
你也可以使用 eclipse maven 对其进行编译,pom 依赖:
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.7.7</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-tools</artifactId>
<version>1.7.7</version>
</dependency>
avro和avro-tools两个依赖包,是avro开发的必备的基础包。如果你的项目需要让maven来根据.avsc文件生成java代码的话,还需要增加如下avro-maven-plugin依赖,否则此处是不需要的。
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.7.7</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/resources/avro/</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<encoding>utf-8</encoding>
</configuration>
</plugin>
看一下编译后的 User.java 代码:
/**
* Autogenerated by Avro
*
* DO NOT EDIT DIRECTLY
*/
package com.beng.rpc.avro;
@SuppressWarnings("all")
@org.apache.avro.specific.AvroGenerated
public class User extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord {
public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"User\",\"namespace\":\"com.beng.rpc.avro\",\"fields\":[{\"name\":\"name\",\"type\":\"string\"},{\"name\":\"favorite_number\",\"type\":[\"int\",\"null\"]},{\"name\":\"favorite_color\",\"type\":[\"string\",\"null\"]}]}");
public static org.apache.avro.Schema getClassSchema() { return SCHEMA$; }
@Deprecated public java.lang.CharSequence name;
@Deprecated public java.lang.Integer favorite_number;
@Deprecated public java.lang.CharSequence favorite_color;
/**
* Default constructor. Note that this does not initialize fields
* to their default values from the schema. If that is desired then
* one should use <code>newBuilder()</code>.
*/
public User() {}
/**
* All-args constructor.
*/
public User(java.lang.CharSequence name, java.lang.Integer favorite_number, java.lang.CharSequence favorite_color) {
this.name = name;
this.favorite_number = favorite_number;
this.favorite_color = favorite_color;
}
public org.apache.avro.Schema getSchema() { return SCHEMA$; }
// Used by DatumWriter. Applications should not call.
public java.lang.Object get(int field$) {
switch (field$) {
case 0: return name;
case 1: return favorite_number;
case 2: return favorite_color;
default: throw new org.apache.avro.AvroRuntimeException("Bad index");
}
}
// Used by DatumReader. Applications should not call.
@SuppressWarnings(value="unchecked")
public void put(int field$, java.lang.Object value$) {
switch (field$) {
case 0: name = (java.lang.CharSequence)value$; break;
case 1: favorite_number = (java.lang.Integer)value$; break;
case 2: favorite_color = (java.lang.CharSequence)value$; break;
default: throw new org.apache.avro.AvroRuntimeException("Bad index");
}
}
/**
* Gets the value of the 'name' field.
*/
public java.lang.CharSequence getName() {
return name;
}
/**
* Sets the value of the 'name' field.
* @param value the value to set.
*/
public void setName(java.lang.CharSequence value) {
this.name = value;
}
/**
* Gets the value of the 'favorite_number' field.
*/
public java.lang.Integer getFavoriteNumber() {
return favorite_number;
}
/**
* Sets the value of the 'favorite_number' field.
* @param value the value to set.
*/
public void setFavoriteNumber(java.lang.Integer value) {
this.favorite_number = value;
}
/**
* Gets the value of the 'favorite_color' field.
*/
public java.lang.CharSequence getFavoriteColor() {
return favorite_color;
}
/**
* Sets the value of the 'favorite_color' field.
* @param value the value to set.
*/
public void setFavoriteColor(java.lang.CharSequence value) {
this.favorite_color = value;
}
/** Creates a new User RecordBuilder */
public static com.beng.rpc.avro.User.Builder newBuilder() {
return new com.beng.rpc.avro.User.Builder();
}
/** Creates a new User RecordBuilder by copying an existing Builder */
public static com.beng.rpc.avro.User.Builder newBuilder(com.beng.rpc.avro.User.Builder other) {
return new com.beng.rpc.avro.User.Builder(other);
}
/** Creates a new User RecordBuilder by copying an existing User instance */
public static com.beng.rpc.avro.User.Builder newBuilder(com.beng.rpc.avro.User other) {
return new com.beng.rpc.avro.User.Builder(other);
}
/**
* RecordBuilder for User instances.
*/
public static class Builder extends org.apache.avro.specific.SpecificRecordBuilderBase<User>
implements org.apache.avro.data.RecordBuilder<User> {
private java.lang.CharSequence name;
private java.lang.Integer favorite_number;
private java.lang.CharSequence favorite_color;
/** Creates a new Builder */
private Builder() {
super(com.beng.rpc.avro.User.SCHEMA$);
}
/** Creates a Builder by copying an existing Builder */
private Builder(com.beng.rpc.avro.User.Builder other) {
super(other);
if (isValidValue(fields()[0], other.name)) {
this.name = data().deepCopy(fields()[0].schema(), other.name);
fieldSetFlags()[0] = true;
}
if (isValidValue(fields()[1], other.favorite_number)) {
this.favorite_number = data().deepCopy(fields()[1].schema(), other.favorite_number);
fieldSetFlags()[1] = true;
}
if (isValidValue(fields()[2], other.favorite_color)) {
this.favorite_color = data().deepCopy(fields()[2].schema(), other.favorite_color);
fieldSetFlags()[2] = true;
}
}
/** Creates a Builder by copying an existing User instance */
private Builder(com.beng.rpc.avro.User other) {
super(com.beng.rpc.avro.User.SCHEMA$);
if (isValidValue(fields()[0], other.name)) {
this.name = data().deepCopy(fields()[0].schema(), other.name);
fieldSetFlags()[0] = true;
}
if (isValidValue(fields()[1], other.favorite_number)) {
this.favorite_number = data().deepCopy(fields()[1].schema(), other.favorite_number);
fieldSetFlags()[1] = true;
}
if (isValidValue(fields()[2], other.favorite_color)) {
this.favorite_color = data().deepCopy(fields()[2].schema(), other.favorite_color);
fieldSetFlags()[2] = true;
}
}
/** Gets the value of the 'name' field */
public java.lang.CharSequence getName() {
return name;
}
/** Sets the value of the 'name' field */
public com.beng.rpc.avro.User.Builder setName(java.lang.CharSequence value) {
validate(fields()[0], value);
this.name = value;
fieldSetFlags()[0] = true;
return this;
}
/** Checks whether the 'name' field has been set */
public boolean hasName() {
return fieldSetFlags()[0];
}
/** Clears the value of the 'name' field */
public com.beng.rpc.avro.User.Builder clearName() {
name = null;
fieldSetFlags()[0] = false;
return this;
}
/** Gets the value of the 'favorite_number' field */
public java.lang.Integer getFavoriteNumber() {
return favorite_number;
}
/** Sets the value of the 'favorite_number' field */
public com.beng.rpc.avro.User.Builder setFavoriteNumber(java.lang.Integer value) {
validate(fields()[1], value);
this.favorite_number = value;
fieldSetFlags()[1] = true;
return this;
}
/** Checks whether the 'favorite_number' field has been set */
public boolean hasFavoriteNumber() {
return fieldSetFlags()[1];
}
/** Clears the value of the 'favorite_number' field */
public com.beng.rpc.avro.User.Builder clearFavoriteNumber() {
favorite_number = null;
fieldSetFlags()[1] = false;
return this;
}
/** Gets the value of the 'favorite_color' field */
public java.lang.CharSequence getFavoriteColor() {
return favorite_color;
}
/** Sets the value of the 'favorite_color' field */
public com.beng.rpc.avro.User.Builder setFavoriteColor(java.lang.CharSequence value) {
validate(fields()[2], value);
this.favorite_color = value;
fieldSetFlags()[2] = true;
return this;
}
/** Checks whether the 'favorite_color' field has been set */
public boolean hasFavoriteColor() {
return fieldSetFlags()[2];
}
/** Clears the value of the 'favorite_color' field */
public com.beng.rpc.avro.User.Builder clearFavoriteColor() {
favorite_color = null;
fieldSetFlags()[2] = false;
return this;
}
@Override
public User build() {
try {
User record = new User();
record.name = fieldSetFlags()[0] ? this.name : (java.lang.CharSequence) defaultValue(fields()[0]);
record.favorite_number = fieldSetFlags()[1] ? this.favorite_number : (java.lang.Integer) defaultValue(fields()[1]);
record.favorite_color = fieldSetFlags()[2] ? this.favorite_color : (java.lang.CharSequence) defaultValue(fields()[2]);
return record;
} catch (Exception e) {
throw new org.apache.avro.AvroRuntimeException(e);
}
}
}
}
使用 User.java
public class Main {
public static void main(String[] args) throws IOException {
// 初始化对象
// 1. get/set
User user = new User();
user.setName("Janny");
user.setFavoriteColor("red");
user.setFavoriteNumber(110);
// 2. 构造函数
User user1 = new User("Danny", 119, "black");
System.out.println(user1.toString());
// 3. Builder 模式
User user2 = User.newBuilder().setName("LiMing").setFavoriteColor("yellow").setFavoriteNumber(120).build();
// 将文件序列化到 user.avro
String path = "user.avro";
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<>(userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File(path));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user);
dataFileWriter.close();
System.out.println();
// 读取序列化的文件
DatumReader<User> reader = new SpecificDatumReader<>(User.class);
DataFileReader<User> dataFileReader = new DataFileReader<>(new File("user.avro"), reader);
User user4 = null;
while (dataFileReader.hasNext()) {
user4 = dataFileReader.next();
System.out.println(user4);
}
}
}
看一下序列化到文件的内容:
Objavro.schema�{"type":"record","name":"User","namespace":"com.beng.rpc.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_number","type":["int","null"]},{"name":"favorite_color","type":["string","null"]}]}�f���rs뿾�z`
Danny�
blackLiMing�yellow
Janny�red�f��
以上是通过 Scheme 生成Java代码
非“代码生成”情况
无需通过Schema生成java代码,但是开发者需要在运行时指定Schema。
//user.avsc放置在“resources/avro”目录下
InputStream inputStream = ClassLoader.getSystemResourceAsStream("avro/user.avsc");
Schema schema = new Schema.Parser().parse(inputStream);
GenericRecord user = new GenericData.Record(schema);
user.put("name", "张三");
user.put("age", 30);
user.put("email","zhangsan@*.com");
File diskFile = new File("/data/users.avro");
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, diskFile);
dataFileWriter.append(user);
dataFileWriter.close();
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(diskFile, datumReader);
GenericRecord _current = null;
while (dataFileReader.hasNext()) {
_current = dataFileReader.next(_current);
System.out.println(user);
}
dataFileReader.close();
这种情况下,没有生成JAVA API,那么序列化过程就需要开发者预先熟悉Schema的结构,创建User的过程就像构建JSON字符串一样,通过put操作来“赋值”。反序列化也是一样,需要指定schema。
GenericRecord接口提供了根据“field”名称获取值的方法:Object get(String fieldName);不过需要声明,这内部实现并不是基于map,而是一个数组,数组和Schema声明的Fileds按照index对应。put操作根据field名称找到对应的index,然后赋值;get反之。那么在对待Schema兼容性上和“代码生成”基本一致。
应用场景: Apache Storm + AVRO + Kafka 三大Apache家族成员进行大数据平台搭建
Kryo 也是一个序列化工具,有时间阔以研究研究。