Hive官方使用手册——Avro Files

Availability

最早支持AvroSerDe的版本

The AvroSerde is available in Hive 0.9.1 and greater.

概述 – Hive中使用Avro

AvroSerde允许用户读取或写入Avro数据到Hive表。以下是AvroSerde的注意事项:

从Avro schema中推断出Hive表的schema。从Hive 0.14开始，可以从Hive表的schema推断Avro模式。
利用Avro的向后兼容性能力，在一个表中以指定的模式读取所有Avro文件。
支持任意嵌套模式。
将所有Avro数据类型转换为等效的Hive类型。大多数类型都可以匹配，但是一些Avro类型不存在于Hive中，AvroSerDe会自动转换。
能够解析压缩后的Avro文件。
显式地将可空类型的Avro处理习惯用法Union[T, null] 转换为只有T，并在适时返回null。
将任何Hive表写入Avro文件。
对于ETL过程的工作，Avro模式依然表现的非常可靠。
从Hive 0.14开始，可以使用Alter table语句将列添加到Avro支持的Hive表中。

有关SerDes的一般信息，请参阅开发指南中的Hive SerDe。还请参阅SerDe，了解输入和输出处理的详细信息。

Requirements

The AvroSerde has been built and tested against Hive 0.9.1 and later, and uses Avro 1.7.5 as of Hive 0.13 and 0.14.

Hive Versions	Avro Version
Hive 0.9.1	Avro 1.5.3
Hive 0.10, 0.11, and 0.12	Avro 1.7.1
Hive 0.13 and 0.14	Avro 1.7.5

Avro to Hive type conversion

While most Avro types convert directly to equivalent Hive types, there are some which do not exist in Hive and are converted to reasonable equivalents. Also, the AvroSerde special cases unions of null and another type, as described below:

Avro type	Becomes Hive type	Note
null	void
boolean	boolean
int	int
long	bigint
float	float
double	double
bytes	binary	在Hive 0.12.0之前，bytes被转换为数组[smallint]。
string	string
record	struct
map	map
list	array
union	union	[T, null]的联合显式地转换可为空的T，其他类型直接转换为Hive的类型结合。然而，在Hive 7中引入了union，目前还不能在where/group-by语句中使用。从本质上说,他们只是能够看到。因为AvroSerde显式地将[T,null]转换为nullable T，这个限制只适用于多个类型联合或组合的联合，而不是单个类型和null。
enum	string	Hive没有enums的概念。
fixed	binary	Fixeds在Hive 0.12.0之前被转换成数组[smallint]。

创建Avro-backed Hive表

使用AvroSerDe可以在Hive中创建avro支持的表。

All Hive versions

创建Avro-backed表，使用org.apache.hadoop.hive.serde2.avro.AvroSerDe指定serde ，指定输入格式为org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat和输出格式为org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat。还提供了一个位置，AvroSerde将从该位置提取大部分表的当前模式。例如:

 
            CREATE TABLE kst 
           
            PARTITIONED BY (ds string) 
           
            ROW FORMAT SERDE 
           
            'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
           
            STORED AS INPUTFORMAT 
           
            'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
           
            OUTPUTFORMAT 
           
            'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' 
           
            TBLPROPERTIES ( 
           
            'avro.schema.url' 
            = 
            'http://schema_provider/kst.avsc' 
            );

在这个示例中，我们从一个webserver中提取了“真相”阅读器模式。下面描述了提供这个模式的其他参数。

Avro文件添加到数据库(或者创建一个外部表)使用标准的Hive操作(http://wiki.apache.org/hadoop/Hive/LanguageManual/DML)。

此表的描述如下:

 
            hive> describe kst; 
           
            OK 
           
            string1 string  from deserializer 
           
            string2 string  from deserializer 
           
            int1     
            int      
            from deserializer 
           
            boolean1         
            boolean  
            from deserializer 
           
            long1   bigint  from deserializer 
           
            float1   
            float    
            from deserializer 
           
            double1  
            double   
            from deserializer 
           
            inner_record1   struct<int_in_inner_record1: 
            int 
            ,string_in_inner_record1:string> from deserializer 
           
            enum1   string  from deserializer 
           
            array1  array<string>   from deserializer 
           
            map1    map<string,string>      from deserializer 
           
            union1  uniontype< 
            float 
            , 
            boolean 
            ,string> from deserializer 
           
            fixed1  binary  from deserializer 
           
            null1    
            void     
            from deserializer 
           
            unionnullint     
            int      
            from deserializer 
           
            bytes1  binary  from deserializer

在这一点上，avo支持的表可以像任何其他表一样在Hive中工作。

Hive 0.14及以后的版本

从Hive 0.14开始，可以通过在DDL语句中使用“STORED AS AVRO”来创建AVRO支持的表。AvroSerDe负责从Hive表模式中创建合适的Avro模式，这在Hive的Avro可用性方面取得了很大的成功。

例如:

 
            CREATE TABLE kst ( 
           
            string1 string, 
           
            string2 string, 
           
            int1  
            int 
            , 
           
            boolean1  
            boolean 
            , 
           
            long1 bigint, 
           
            float1  
            float 
            , 
           
            double1  
            double 
            ,