apache avro 是一个数据序列化系统,是一个基于二进制数据传输高性能的中间件。
1. 提供以下特征
富有的数据结构
一个简洁紧凑,快速的二进制数据格式
一个持久存储数据的文件容器
远程过程调用(RPC)
简单的动态语言结合功能,Avro和动态语言结合后,读写数据文件和使用RPC协议都不需要生成代码,而代码生成作为一种可选的优化只值得在静态类型语言中实现
2. 与其他系统的比较
Avro 支持跨编程语言实现(C, C++, C#,Java, Python, Ruby, PHP),Avro 提供着与诸如 Thrift 和 Protocol Buffers 等系统相似的功能,但是在一些基础方面还是有区别的,主要是:
动态类型:Avro 并不需要生成代码,模式和数据存放在一起,而模式使得整个数据的处理过程并不生成代码、静态数据类型等等。这方便了数据处理系统和语言的构造。
未标记的数据:由于读取数据的时候模式是已知的,那么需要和数据一起编码的类型信息就很少了,这样序列化的规模也就小了。
不需要用户指定字段号:即使模式改变,处理数据时新旧模式都是已知的,所以通过使用字段名称可以解决差异问题。
3.avro价值所在
Avro可以做到将数据进行序列化,适用于远程或本地大批量数据交互。
在传输的过程中Avro对数据二进制序列化后 节约数据存储空间 和 网络传输带宽。
做个比方:有一个100平方的房子,本来能放100件东西,现在期望借助某种手段能让原有面积的房子能存放比原来多150件以上或者更多的东西,就好比数据存放在缓存中,缓存是精贵的,需要充分的利用缓存有限的空间,存放更多的数据。再例如网络带宽的资源是有限的,希望原有的带宽范围能传输比原来高大的数据量流量,特别是针对结构化的数据传输和存储,这就是Avro存在的意义和价值。
4. Getting Started (Java)
1) new a maven project
pom.xml
<dependencies>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.7.7</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.7.7</version>
<executions>
<execution>
<phase>generate-sources</phase>
<goals>
<goal>schema</goal>
</goals>
<configuration>
<sourceDirectory>${project.basedir}/src/main/resources/</sourceDirectory>
<outputDirectory>${project.basedir}/src/main/java/</outputDirectory>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
</plugins>
2)定义schema
user.avsc文件
{"namespace": "org.pq.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
3) Serializing and deserializing with code generation
在当前maven工程目录下执行:$ mvn clean compile
执行结果,会在org.pq.avro目录(注意User.avsc文件的namespace)下生成 User.java类。
然后写测试类 Test.java
package org.pq.avro;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;
import java.io.File;
import java.io.IOException;
public class Test {
public static void main(String[] args) throws IOException {
//1.creating Users
User u1 = new User();
u1.setName("Alyssa");
u1.setFavoriteNumber(256);
User u2 = new User("Ben",7,"red");
User u3 = User.newBuilder()
.setName("Charlie")
.setFavoriteColor("blue")
.setFavoriteNumber(null)
.build();
//2.Now let's serialize our Users to disk
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);
File file = new File("users.avro");
dataFileWriter.create(u1.getSchema(),file);
dataFileWriter.append(u1);
dataFileWriter.append(u2);
dataFileWriter.append(u3);
dataFileWriter.close();
//3.Deserialize Users from dist
DatumReader<User> userDatumReader = new SpecificDatumReader<User>(User.class);
DataFileReader<User> dataFileReader = new DataFileReader<User>(file, userDatumReader);
User user = null;
while (dataFileReader.hasNext()) {
// Reuse user object by passing it to next(). This saves us from
// allocating and garbage collecting many objects for files with
// many items.
user = dataFileReader.next(user);
System.out.println(user);
}
}
}
Run Result:
{"name": "Alyssa", "favorite_number": 256, "favorite_color": null} {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} {"name": "Charlie", "favorite_number": null, "favorite_color": "blue"} |
4)Serializing and deserializing without code generation
package org.pq.avro;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import java.io.File;
import java.io.IOException;
import java.net.URISyntaxException;
public class Test2 {
public static void main(String[] args) throws IOException, URISyntaxException {
//First, we use a Parser to read our schema definition and create a Schema object.
File file = new File(Test2.class.getClassLoader().getResource("user.avsc").toURI());
Schema schema = new Schema.Parser().parse(file);
//using this schema,let's create some users
GenericRecord u1 = new GenericData.Record(schema);
u1.put("name","Alyssa");
u1.put("favorite_number",256);
GenericRecord u2 = new GenericData.Record(schema);
u2.put("name","Ben");
u2.put("favorite_number",7);
u2.put("favorite_color","red");
// Serialize u1 and u2 to disk
File usersFile = new File("users.avro");
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, file);
dataFileWriter.append(u1);
dataFileWriter.append(u2);
dataFileWriter.close();
// Deserialize users from disk
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, datumReader);
GenericRecord user = null;
while (dataFileReader.hasNext()) {
// Reuse user object by passing it to next(). This saves us from
// allocating and garbage collecting many objects for files with
// many items.
user = dataFileReader.next(user);
System.out.println(user);
}
}
}
Run result:
{"name": "Alyssa", "favorite_number": 256, "favorite_color": null} {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} |
参考:
https://avro.apache.org/docs/current/gettingstartedjava.html
http://www.javabloger.com/article/hadoop-avro-rpc-java.html