几种保存Hive查询结果的方法

很多时候,我们需要将Hive的查询(select)结果保存起来,方便进一步处理或查看。
在Hive里面提供了不同的方式来保存查询结果,在这里做下总结:
一、保存结果到本地

方法1:调用hive标准输出,将查询结果写到指定的文件中

这个方法最为常见,笔者也经常使用。sql的查询结果将直接保存到/tmp/out.txt中
$ hive -e "select user, login_timestamp from user_login" > /tmp/out.txt
当然我们也可以查询保存到某个文件file.sql中,按下面的方式执行查询,并保存结果

$ hive -f file.sql > /tmp/out.txt

 下面是file.sql的内容:

$ cat file.sql

select user, login_timestamp from user_login


hive客户的的详细使用方法可以参考hive的官方文档《Hive Batch Mode Commands》

    hive -e '<query-string>' executes the query string.
    hive -f <filepath> executes one or more SQL queries from a file.

方法2:使用INSERT OVERWRITE LOCAL DIRECTORY结果到本地

    hive> insert overwrite local directory "/tmp/out/"                                        
        > select user, login_time from user_login;

上面的命令会将select user, login_time from user_login的查询结果保存到/tmp/out/本地目录下。


我们查看一下/tmp/out/目录下的文件,发现命令执行后,多了两个文件:

$ find /tmp/out/ -type f
/tmp/out/.000000_0.crc
/tmp/out/000000_0

这两个文件存放的内容不一样,其中000000_0存放查询的结果,带有crc后缀的存放那个文件的crc32校验。

用vim打开查看下000000_0的内容:

vim /tmp/out/000000_0
 1 user_1^A20140701
 2 user_2^A20140701
 3 user_2^A20140701

可以看到,导出的查询结果字段之间是用^A(Ctrl+A)作为分割符,行与行之间用\n作为分割。


默认的字段分割符有时候可能不太方便,幸好Hive提供了修改分割符号的方法,我们只要在导出时指定就可以了:

    hive> insert overwrite local directory "/tmp/out/"
        > row format delimited fields terminated by "\t" 
        > select user, login_time from user_login;

$ vim /tmp/out/000000_0
1 user_1 20140701
2 user_2 20140701
3 user_2 20140701

可以看到字段分割符已经变成了tab(人眼看起来更舒服^-^)。


同样,我们也可以指定复杂类型(collection、map)的输出格式

更多关于INSERT OVERWRITE LOCAL DIRECTORY的语法,可以参考HIVE的官方文档《Writing data into the filesystem from queries》。

    Standard syntax:
    INSERT OVERWRITE <span style="color:#ff0000;">[LOCAL] </span>DIRECTORY directory1
      [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
      SELECT ... FROM ...
     
    Hive extension (multiple inserts):
    FROM from_statement
    INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
    [INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
     
     
    row_format
      : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
            [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
            [NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)

二、保存结果到hdfs

保存查询结果到hdfs很简单,使用INSERT OVERWRITE DIRECTORY就可以完成操作:

    hive> insert overwrite directory "/tmp/out/"
        > row format delimited fields terminated by "\t" 
        > select user, login_time from user_login;

需要注意的是,跟保存到本地文件系统的差别是,保存到hdfs时命令不需要指定LOCAL项

更多关于INSERT OVERWRITE DIRECTORY的语法,可以参考HIVE的官方文档《Writing data into the filesystem from queries》。

三、保存结果到HIVE表
方法1、已经建好结果表,使用INSERT OVERWRITE TABLE以覆盖方式写入结果表

如果结果表已经建好,可以使用INSERT OVERWRITE TABLE将结果写入结果表:

    hive> desc query_result;
    OK
    user                	string,

    login_time          	bigint
    hive> insert overwrite table query_result     
        > select user, login_time from user_login;

    hive> select * from query_result;             
    OK
    user_1	20140701
    user_2	20140701
    user_3	20140701


当然,HIVE也提供了追加方式INSERT TABLE,可以在原有数据后面加上新的查询结果。在上面这个例子基础上,我们再追加一个查询结果:

    hive> insert into table query_result
        > select * from query_result;

    hive> select * from query_result;
    OK
    user_1	20140701
    user_2	20140701
    user_3	20140701
    <span style="color:#ff0000;">user_1	20140701
    user_2	20140701
    user_3	20140701</span>

注意标红的部分,使用INSERT TABLE后,query_result增加了三行数据

更多关于INSERT OVERWRITE TABLE的语法,可以参考HIVE官方文档《Inserting data into Hive Tables from queries》

    Standard syntax:
    INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
    INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
     
    Hive extension (multiple inserts):
    FROM from_statement
    INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
    [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
    [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
    FROM from_statement
    INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
    [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
    [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;
     
    Hive extension (dynamic partition inserts):
    INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
    INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

方法2、如果需要新建一个表,用于存放查询结果,可以使用CREATE TABLE AS SELECT语法

    hive> create table query_result 
        > as
        > select user, login_time from user_login;

    hive> select * from query_result;            
    OK
    user_1	20140701
    user_2	20140701
    user_3	20140701

更多关于CREATE TABLE AS SELECT的语法,可以参考HIVE官方文档《Create Table As Select (CTAS)》

    CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name   (Note: TEMPORARY available starting with Hive 0.14.0)
      [(col_name data_type [COMMENT col_comment], ...)]
      [COMMENT table_comment]
      [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
      [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
      [SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...])
                                          [STORED AS DIRECTORIES] (Note: Only available starting with Hive 0.10.0)]
      [
       [ROW FORMAT row_format] [STORED AS file_format]
       | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  (Note: Only available starting with Hive 0.6.0)
      ]
      [LOCATION hdfs_path]
      [TBLPROPERTIES (property_name=property_value, ...)]  (Note: Only available starting with Hive 0.6.0)
      [AS select_statement]  (Note: Only available starting with Hive 0.5.0, and not supported when creating external tables.)
     
    CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
      LIKE existing_table_or_view_name
      [LOCATION hdfs_path]
     
    data_type
      : primitive_type
      | array_type
      | map_type
      | struct_type
      | union_type (Note: Only available starting with Hive 0.7.0)
     
    primitive_type
      : TINYINT
      | SMALLINT
      | INT
      | BIGINT
      | BOOLEAN
      | FLOAT
      | DOUBLE
      | STRING
      | BINARY (Note: Only available starting with Hive 0.8.0)
      | TIMESTAMP (Note: Only available starting with Hive 0.8.0)
      | DECIMAL (Note: Only available starting with Hive 0.11.0)
      | DECIMAL(precision, scale) (Note: Only available starting with Hive 0.13.0)
      | VARCHAR (Note: Only available starting with Hive 0.12.0)
      | CHAR (Note: Only available starting with Hive 0.13.0)
     
    array_type
      : ARRAY < data_type >
     
    map_type
      : MAP < primitive_type, data_type >
     
    struct_type
      : STRUCT < col_name : data_type [COMMENT col_comment], ...>
     
    union_type
       : UNIONTYPE < data_type, data_type, ... > (Note: Only available starting with Hive 0.7.0)
     
    row_format
      : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
            [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
            [NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)
      | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
     
    file_format:
      : SEQUENCEFILE
      | TEXTFILE
      | RCFILE     (Note: Only available starting with Hive 0.6.0)
      | ORC        (Note: Only available starting with Hive 0.11.0)
      | AVRO       (Note: Only available starting with Hive 0.14.0)
      | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

四、使用hdfs直接导出表

Hive是构建在hdfs上的,因此,我们可以使用hdfs的命令hadoop dfs -get直接导出表。

首先、我们先找到要导出的表存放到哪个目录下:

    hive> show create table user_login;
    OK
    CREATE  TABLE `user_login`(
      `user` string, 
      `login_time` bigint)
    ROW FORMAT SERDE 
      'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    <span style="color:#ff0000;">LOCATION
      'file:/user/hive/warehouse/test.db/user_login'</span>
    TBLPROPERTIES (
      'totalSize'='160', 
      'numRows'='10', 
      'rawDataSize'='150', 
      'COLUMN_STATS_ACCURATE'='true', 
      'numFiles'='1', 
      'transient_lastDdlTime'='1411544983')
    Time taken: 0.174 seconds, Fetched: 18 row(s)

可以看到,user_login表存放到在file:/user/hive/warehouse/test.db/user_login。

接下来,直接利用hadoop dfs -get导出到本地:

hadoop dfs -get file:/user/hive/warehouse/test.db/user_login  /tmp/out/

更多关于hadoop  dfs -get命令,可以参考hadoop dfs命令界面文档《File System Shell》

作者:手软脚软
来源:CSDN
原文:https://blog.csdn.net/zhuce1986/article/details/39586189
版权声明:本文为博主原创文章,转载请附上博文链接!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值