原文地址:https://www.bbsmax.com/A/QW5YY36N5m/
hive中建立外部分区表,外部数据格式是json的如何导入呢?
json格式的数据表不必含有分区字段,只需要在hdfs目录结构中体现出分区就可以了
This is all according to this guide: http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
- hive> ADD JAR /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar;
- Added /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar to class path
- Added resource: /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar
In /tmp/new
I have a file abc.json
The CREATE EXTERNAL TABLE command runs properly, but it doesn't take in any data:
- hive>
- > CREATE EXTERNAL TABLE tweets (
- > id BIGINT,
- > created_at STRING,
- > source STRING,
- > favorited BOOLEAN,
- > retweeted_status STRUCT<
- > text:STRING,
- > user:STRUCT<screen_name:STRING,name:STRING>,
- > retweet_count:INT>,
- > entities STRUCT<
- > urls:ARRAY<STRUCT<expanded_url:STRING>>,
- > user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
- > hashtags:ARRAY<STRUCT<text:STRING>>>,
- > text STRING,
- > user STRUCT<
- > screen_name:STRING,
- > name:STRING,
- > friends_count:INT,
- > followers_count:INT,
- > statuses_count:INT,
- > verified:BOOLEAN,
- > utc_offset:INT,
- > time_zone:STRING>,
- > in_reply_to_screen_name STRING
- > )
- > PARTITIONED BY (datehour INT)
- > ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
- > LOCATION '/tmp/new';
- OK
- Time taken: 0.142 seconds
注意外部分区表需要手工添加分区
具体步骤
- .) Run the create table statement.
- .) In the directory /tmp/new/ create a sub directory datehour=<some int value>, and then put your .json file inside this.这里就是说只需要在hdfs上建立目录,目录体现分区信息,将数据放到对应目录,然后直接add partiton就好了
- .) Run alter table statement adding this partition to metadata:
- alter table tweets add partition(datehour=<some int value>);
- .) Now run the select statement.