hive中建立外部分区表,外部数据格式是json的如何导入呢?
json格式的数据表不必含有分区字段,只需要在hdfs目录结构中体现出分区就可以了
This is all according to this guide: http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
1 hive> ADD JAR /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar; 2 Added /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar to class path 3 Added resource: /home/hadoop/hive-serdes-1.0-SNAPSHOT.jar
In /tmp/new
I have a file abc.json
The CREATE EXTERNAL TABLE command runs properly, but it doesn't take in any data:
hive> > CREATE EXTERNAL TABLE tweets ( > id BIGINT, > created_at STRING, > source STRING, > favorited BOOLEAN, > retweeted_status STRUCT< > text:STRING, > user:STRUCT<screen_name:STRING,name:STRING>, > retweet_count:INT>, > entities STRUCT< > urls:ARRAY<STRUCT<expanded_url:STRING>>, > user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, > hashtags:ARRAY<STRUCT<text:STRING>>>, > text STRING, > user STRUCT< > screen_name:STRING, > name:STRING, > friends_count:INT, > followers_count:INT, > statuses_count:INT, > verified:BOOLEAN, > utc_offset:INT, > time_zone:STRING>, > in_reply_to_screen_name STRING > ) > PARTITIONED BY (datehour INT) > ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' > LOCATION '/tmp/new'; OK Time taken: 0.142 seconds
注意外部分区表需要手工添加分区
具体步骤
1.) Run the create table statement. 2.) In the directory /tmp/new/ create a sub directory datehour=<some int value>, and then put your .json file inside this.这里就是说只需要在hdfs上建立目录,目录体现分区信息,将数据放到对应目录,然后直接add partiton就好了 3.) Run alter table statement adding this partition to metadata: alter table tweets add partition(datehour=<some int value>); 4.) Now run the select statement.