avro 数据结构

最新推荐文章于 2025-02-27 14:16:16 发布

打磨时光

最新推荐文章于 2025-02-27 14:16:16 发布

阅读量1.3k

点赞数

分类专栏：大数据处理

大数据处理专栏收录该内容

2 篇文章

订阅专栏

【转】https://shift-alt-ctrl.iteye.com/blog/2217425
Avro Schema格式是JSON，所以编写起来非常简单，只需要了解Avro的规范即可，接下来简单介绍一些Avro的数据结构。

Primitive Types（原生类型）：null，boolean，int，long，float，double，bytes，string

复合类型：包括6种 record，enum，array，map，union，fixed

1. record

这个和java中的“class”有同等的意义，它支持如下属性：

1）name：必要属性，表示record的名称，在java生成代码时作为类的名称。
2）namespace：限定名，在java生成代码时作为package的名字。其中namespace + name最终构成record的全名。
3）doc：可选，文档信息，备注信息。
4）aliases：别名
5）fields：field列表，严格有序。每个filed又包括“name”、“doc”、“type”、“default”、“order”、“aliases”几个属性。

其中在fields列表中每个filed应该拥有不重复的name，type 表示field的数据类型。
default 很明显，用来表示field的默认值，当reader读取时，当没有此field时将会采用默认值；order：可选值，排序（ascending、descending、ignore），在Mapreduce集成时有用。
aliases 别名，JSON Array，表示此field的别名列表。

{  
  "type": "record",   
  "name": "User",  
  "namespace":"com.test.avro",  
  "aliases": ["User1","User2"],                        
  "fields" : [  
    {"name": "age", "type": "int","default":10},               
    {"name": "email", "type": ["null", "string"]}  
  ]  
}

2. enum

{ "type": "enum",  
  "name": "Suit",  
  "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]  
}

“symbols”属性即位enum的常量值列表。值不能重复。

3. array

数组，和java中的数组没有区别。其item属性表示数组的item的类型。

{"type":"array","items":"string"}

4. map

跟java中的map一样，只是key必须为string（无法声明key的类型），“values”属性声明value的类型。

{"type":"map","values":"long"}

5. union

集合，数学意义上的“集合”，集合中的数据不能重复。如果对“type”使用union类型，那么其default值必须和union的第一个类型匹配。

{"type":["int","null"],"default":10}

6. fixed

表示field的值的长度为“固定值”，“size”属性表示值的字节长度

{"type":"fixed","size":16,"name":"md5"}

record、enum、fixed、field属性，都可以声明aliases，别名–曾用名，这在schema兼容机制中非常重要。对于reader而言可以使用aliases来映射writer的schema，就像模式演变一样来处理不同的数据集。比如writer schema中有个filed命名为“Foo”，reader schema中有个filed为“Bar”并且有个别名为“Foo”,那么在reader处理数据时Bar将可以与数据中“Foo”映射并正常处理。如果schema中有多个Field重名，那么可以借助“namespace”来组合成全限定名（full namespace-qualified）。

avro 编码

Avro实现了两种编码：BinaryEncoder、JsonEncoder。对于数据存储或者RPC通常使用BinaryEncoder，这意味着数据尺寸小而且处理更加快速；不过对于debugging或者基于web的应用，JsonEncoding通常比较便捷（即数据格式采用JSON，借助jackson）。

Binary Encoding

对于primitive类型，binary编码规则如下：“null”值将会写入0字节，“boolean”写入一个单独的字节表示0或者1；“int”和“long”采用“varint”编码技巧；“float”为4个字节，“double”为8个字节，“bytes”的编码是[long数字表示长度] + 字节数组；“string”编码类似于bytes，只是字节是UTF-8编码之后的。比较有意思的是，avro编码并不会把字段的索引号、field类型输出到流中，这一点区别于protobuf。
对于复合类型record，则不会将record的结构编码，只会依次编码filed，即record编码的结果其实就是所有fields依次编码的整合体。解析的过程将交给schema逻辑，而非将record结构信息编码到结果中。enum有一序列symbols，那么编码时只需要将enum的值所在的index作为结果即可，比如enum有[“A”,“b”,“C”]三个可选值，如果其值为A，那么只需要用0来表示即可，所以这也要求开发者不能随意变更enum类型的值列表。
array类型的编码稍有复杂，一个array可能包含多个item值，那么这些item将会被编码成一序列的blocks，每个block包含几个item，数量不等，具体一个block包含几个item，有writer的buffer决定，即writer的buffer满时将会写入一个block。所以对于reader而言，是不能预先知道array有多少个item。一个block是由一个long计数值和多个item构成，这个long计数值表示当前block中item的个数；如果一个block的long计数值为0，则表示array的结束。item的type决定它使用何种具体的编码（参见primitive）。
对于union编码，首先union的schema声明中包含多个值比如{“type”:[“null”,“string”]}，那么编码是，首先输出值所属type在union数组的位置（起始位0），然后输出值的二进制。比如“null”则会输出“00”，如果值为“a”，则会输出“02 02 61”（02表示string在union的索引位置为1，第二个02表示string的长度为1，这上string的编码结构）。
fixed：这个很简单，因为fixed本身不是一种数据结构，仅仅表示字节长度，那么直接生成将bytes输出即可。

JSON Encoding

将数据输出为json结构的字符串，key-value结构。需要注意一下，null将会输出“null”字符串，这个是由json决定。
avro输出的数据文件，也是格式严谨的。它由header和多个file data block构成。其中header包括“magic”、“meta”、“sync”三个属性，magic通常为魔法数字：四个字节，ASII ‘0’ ‘b’ ‘j’ 然后紧跟一个数字1（参看源码）；meta即为schema信息，sync为同步点，目前为16个字节的随机sync标记。
file data block包含：1）此block中包含的avro对象的个数（object counts），long型 2）此block中data序列化的长度（block size），long型 3）序列化的对象列表，如果制定了codec，那么此对象列表是经过压缩的。 4）一个16字节的sync标记。
之所以将数据以block的方式组织，可以非常高效的“提取”或者skipped某个block（由sync判定）而不需要反序列化它的全部内容。将“block size”、“object counts”、“sync标记”组合在一起，也可以帮助检测block是否损坏，以确保数据的完整性。（参见DataFileReader，nextBlock等API）
目前支持的codec有“null”、“deflate”，可选“snappy”（需要手动安装和指定）。

Protocol与RPC

Protocol就像Java中interface，定义RPC操作。一个protocol包含如下属性：
1）protocol：protocol的名字，必要属性，对于java而言就是interface的名字。
2）namespace：可选属性，类似于package名称。
3）doc：备注，注释
4）types：用于定义protocol中所涉及到的数据类型（包括record，enum，fixed，error）。error的定义和record一样，其实它在语义上也是一个record，用来表示protocol的exception信息。
5）messages：类似于JAVA中的方法（method）列表。它有几个属性：doc、request、response、error。

其中doc还是表示注释；request为JSON数组，用于声明此message的请求参数列表，结构类似于record；response表示响应的数据类型，如果为“null”表示无需响应值；error是一个可选项，声明可能的exception类型列表，union类型的数据结构。

代码实例，首先需要增加一个avro-ipc依赖，同时修改插件的配置（pom.xml）：

<dependency>  
  <groupId>org.apache.avro</groupId>  
  <artifactId>avro-ipc</artifactId>  
  <version>1.7.7</version>  
</dependency>

<plugin>  
    <groupId>org.apache.avro</groupId>  
    <artifactId>avro-maven-plugin</artifactId>  
    <version>1.7.7</version>  
    <executions>  
        <execution>  
            <phase>generate-sources</phase>  
            <goals>  
                <goal>schema</goal>  
                <goal>protocol</goal>  
                <goal>idl-protocol</goal>  
            </goals>  
            <configuration>  
                <sourceDirectory>${project.basedir}/src/main/resources/avro/</sourceDirectory>  
                <outputDirectory>${project.basedir}/src/main/java/</outputDirectory>  
            </configuration>  
        </execution>  
    </executions>  
</plugin>

helloworld.avpr 此文件用于声明protocol schema

{  
  "namespace": "com.test.avro.rpc",  
  "protocol": "HelloWorld",  
  "doc": "Protocol Greetings",  
  
  "types": [  
    {"name": "Greeting", "type": "record", "fields": [  
      {"name": "message", "type": "string"}]},  
    {"name": "Curse", "type": "error", "fields": [  
      {"name": "message", "type": "string"}]}  
  ],  
  
  "messages": {  
    "hello": {  
      "doc": "Say hello.",  
      "request": [{"name": "greeting", "type": "Greeting" }],  
      "response": "Greeting",  
      "errors": ["Curse"]  
    }  
  }  
}

然后像普通的avro生成代码一样，执行“maven compile”即可生成protocol所需要的java代码。

HelloWorldImpl.java
声明了一个Portocol为HelloWorld，代码生成后，那么HelloWorld就是一个接口，我们需要继承此接口实现真正的业务逻辑。

public class HelloWorldImpl implements HelloWorld {  
    @Override  
    public Greeting hello(Greeting greeting) throws AvroRemoteException, Curse {  
        System.out.println(greeting.getMessage());  
        greeting.setMessage("From Server");  
        return greeting;  
    }  
}

测试代码

public static void main(String[] args) throws Exception{  
  
    Server server = new NettyServer(new SpecificResponder(HelloWorld.class, new HelloWorldImpl()), new InetSocketAddress(8080));  
    server.start();  
    Thread.sleep(3000);  
  
    NettyTransceiver client = new NettyTransceiver(new InetSocketAddress(8080));  
    // client code - attach to the server and send a message  
    HelloWorld proxy = (HelloWorld) SpecificRequestor.getClient(HelloWorld.class, client);  
    Greeting request = new Greeting();  
    request.setMessage("From client");  
    Greeting response = proxy.hello(request);  
    System.out.println(response.getMessage());  
  
    client.close();  
    server.close();  
  
}

过程非常简单，我们看到Avro-ipc其实是依赖了netty的相关jar，其实通过netty来实现底层的IO通讯是一个不错的选择。avro-ipc还有多种方式，比如HttpServer，SaslSocketServer，DatagramServer，大家可以根据实际情况选择合适的通讯方式。它们的内部实现基本类似，基于动态代理 + 反射机制，因为RPC都是“交互式”操作，如果在production环境中使用，通常开发者还需要考虑对client端、server端使用连接池机制，以提高吞吐能力，不过这些在avro-ipc中并没有提供，需要开发者自己实现；同时需要注意NettyTransceiver本身不能在多线程环境中使用，开发者需要将请求队列化，或者为每个request分配一个唯一的ID，以避免消息的错乱。
当使用Http协议时，Avro通过request、response来交换消息，一个protocol通常由一个URL表达，Http的Content-type需要为“avro/binary”，并且client端需要通过post方式发送。
“handshake”：“握手”的主要目的就是确保client和server端都能够持有对方的protocol声明，那么client可以正确的反序列化response，server端可以正确的反序列化request。client、server在进行实际操作之前，首先会通过handshake交换（确认）protocol schema，对于Http而言，是无状态的，那么也意味着每次请求都会进行handshake。对于有状态的通道，比如TCP，handshake只需要在connection建立之后进行一次即可，那么protocol schema将会被双方缓存起来。

{  
  "type": "record",  
  "name": "HandshakeRequest", "namespace":"org.apache.avro.ipc",  
  "fields": [  
    {"name": "clientHash",  
     "type": {"type": "fixed", "name": "MD5", "size": 16}},  
    {"name": "clientProtocol", "type": ["null", "string"]},  
    {"name": "serverHash", "type": "MD5"},  
    {"name": "meta", "type": ["null", {"type": "map", "values": "bytes"}]}  
  ]  
}  
{  
  "type": "record",  
  "name": "HandshakeResponse", "namespace": "org.apache.avro.ipc",  
  "fields": [  
    {"name": "match",  
     "type": {"type": "enum", "name": "HandshakeMatch",  
              "symbols": ["BOTH", "CLIENT", "NONE"]}},  
    {"name": "serverProtocol",  
     "type": ["null", "string"]},  
    {"name": "serverHash",  
     "type": ["null", {"type": "fixed", "name": "MD5", "size": 16}]},  
    {"name": "meta",  
     "type": ["null", {"type": "map", "values": "bytes"}]}  
  ]  
}

上述即为handshake的schema，即client和server端交换protocol schema时所使用的“schema”，一个是client request，一个是server response。
client请求时将会发送其本地protocol的hash值（clientHash），如果它已经获得过server端的hash值，此时也会传递过去（serverHash）。此后server端将会响应，如果client发送的hash值与server端计算的一致，返回结构为（match=BOTH,serverProtocol=null,serverHash=null）；如果不一致，这就意味着client和server端schema不同，那么server将会把自己的protocol反馈给client，响应结果为（match=CLIENT,serverProtocol!=null,serverHash!=null），client必须使用server返回的protocol处理此后的响应并将此protocol缓存起来。
上述中提到需要client与server端的schema必须一致，MD5值一样；这个计算过程则需要非常的精细。如果断言2个schema在语义上是一致的？比如client schema中包含doc属性，而server端没有，但是这个doc并不影响语义上的解析，在逻辑上这两个schema仍然是“一致的”，但是MD5的计算值则不同（字面值不同）。这就引入了“解析范式”。（比较复杂，参见官网详解）
至此，我们基本了解了avro的核心特性，以及如何使用avro实现简单的应用。我个人认为avro在RPC层面和thrift还有很大的差距，在使用thrift做RPC应用时非常简单而且是production级别可用。本人更加倾向于使用avro做数据存储和解析