在Hive中使用Avro

最新推荐文章于 2020-04-22 21:14:24 发布

me_lawrence

最新推荐文章于 2020-04-22 21:14:24 发布

阅读量2.7k

点赞数

分类专栏： hadoop ecosystem

hadoop ecosystem 专栏收录该内容

25 篇文章 0 订阅

订阅专栏

http://www.iteblog.com/archives/1007

为了解析Avro格式的数据，我们可以在Hive建表的时候用下面语句：

 
hive> CREATE EXTERNAL TABLE tweets
 
    > COMMENT "A table backed by Avro data with the
 
    >        Avro schema embedded in the CREATE TABLE statement"
 
    > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
 
    > STORED AS
 
    > INPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
 
    > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
 
    > LOCATION '/user/wyp/examples/input/'
 
    > TBLPROPERTIES (
 
    >    'avro.schema.literal'='{
 
    >        "type": "record",
 
    >        "name": "Tweet",
 
    >        "namespace": "com.miguno.avro",
 
    >        "fields": [
 
    >            { "name":"username",  "type":"string"},
 
    >            { "name":"tweet",     "type":"string"},
 
    >            { "name":"timestamp", "type":"long"}
 
    >        ]
 
    >   }'
 
    > );
 
OK
 
Time taken: 0.076 seconds
 
 
 
hive> describe tweets;                                                            
 
OK
 
username                string                  from deserializer  
 
tweet                   string                  from deserializer  
 
timestamp               bigint                  from deserializer

然后用Snappy压缩我们需要的数据，下面是压缩前我们的数据：

 
{
 
   "username": "miguno",
 
   "tweet": "Rock: Nerf paper, scissors is fine.",
 
   "timestamp": 1366150681
 
},
 
{
 
   "username": "BlizzardCS",
 
   "tweet": "Works as intended.  Terran is IMBA.",
 
   "timestamp": 1366154481
 
},
 
{
 
   "username": "DarkTemplar",
 
   "tweet": "From the shadows I come!",
 
   "timestamp": 1366154681
 
},
 
{
 
   "username": "VoidRay",
 
   "tweet": "Prismatic core online!",
 
   "timestamp": 1366160000
 
}

压缩完的数据假如存放在/home/wyp/twitter.avsc文件中，我们将这个数据复制到HDFS中的/user/wyp/examples/input/目录下：

`1`	`hadoop fs -put /home/wyp/twitter.avro /user/wyp/examples/input/`

然后我们就可以在Hive中使用了：

 
hive> select * from tweets limit 5;;
 
OK
 
miguno  Rock: Nerf paper, scissors is fine. 1366150681
 
BlizzardCS  Works as intended.  Terran is IMBA. 1366154481
 
DarkTemplar From the shadows I come!    1366154681
 
VoidRay Prismatic core online!  1366160000
 
Time taken: 0.495 seconds, Fetched: 4 row(s)

当然，我们也可以将avro.schema.literal中的

 
{
 
   "type": "record",
 
   "name": "Tweet",
 
   "namespace": "com.miguno.avro",
 
   "fields": [
 
      {
 
         "name": "username",
 
         "type": "string"
 
      },
 
      {
 
         "name": "tweet",
 
         "type": "string"
 
      },
 
      {
 
         "name": "timestamp",
 
         "type": "long"
 
      }
 
   ]
 
}

存放在一个文件中，比如：twitter.avsc,然后上面的建表语句就可以修改为：

 
CREATE EXTERNAL TABLE tweets
 
    COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
 
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
 
    STORED AS
 
    INPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
 
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
 
    LOCATION '/user/wyp/examples/input/'
 
    TBLPROPERTIES (
 
        'avro.schema.url'='hdfs:///user/wyp/examples/schema/twitter.avsc'
 
    );