Hive文件格式


Hive有四种文件格式:TextFile,SequenceFile,RCFile,ORC


TextFile

默认的格式,文本格式。

SequenceFile

简介

见:http://blog.csdn.net/zengmingen/article/details/52242768

操作

hive (zmgdb)>create table t2(str string) stored assequencefile;
OK
Time taken: 0.299 seconds
hive (zmgdb)> desc formatted t2;
OK
..............................
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Compressed: No

sequenceFile的表导入数据不能用load,
[root@hello110 data]# vi test_data
3
we
ew
e
re
er51
2

hive (zmgdb)> load data local inpath '/data/test_data' into table t1;
Loading data to table zmgdb.t1
OK
Time taken: 1.498 seconds
hive (zmgdb)> load data local inpath '/data/test_data' into table t2;
FAILED: SemanticException Unable to load data to destination table. Error: The file that you are trying to load does not match the file format of the destination table.
要用 INSERT OVERWRITE TABLE test2 SELECT * FROM test1;开启mapreduce保存
hive (zmgdb)> insert overwrite table t2 select * from t1;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20160914215205_992081a3-1783-4052-8da8-53e6097a2775
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1473855624724_0001, Tracking URL = http://hello110:8088/proxy/application_1473855624724_0001/
Kill Command = /home/hadoop/app/hadoop-2.7.2/bin/hadoop job -kill job_1473855624724_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-14 21:52:22,073 Stage-1 map = 0%, reduce = 0%
2016-09-14 21:52:43,733 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.9 sec
MapReduce Total cumulative CPU time: 2 seconds 900 msec
Ended Job = job_1473855624724_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hello110:9000/user/hive/warehouse/zmgdb.db/t2/.hive-staging_hive_2016-09-14_21-52-05_274_2207100662758769951-1/-ext-10000
Loading data to table zmgdb.t2
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.9 sec HDFS Read: 3844 HDFS Write: 1534 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 900 msec
OK
t1.str
Time taken: 43.709 seconds
hive (zmgdb)> select * from t2;
OK
t2.str
1
2
2
43
4
dds
ads
fdsdsf
fds
ad
查看hdfs里sequencefile的原文件

sequencefile的底层保存的是二进制格式,0101010101的。


RCFile

一种行列存储相结合的存储方式。首先,其将数据按行分块,保证同一个record在一个块上,避免读一个记录需要读取多个block。其次,块数据列式存储,有利于数据压缩和快速的列存取。


hive (zmgdb)> create table rc_t1(id string) stored as rcfile;
OK
Time taken: 0.334 seconds

hive (zmgdb)> desc formatted rc_t1;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
id                      string                                      
                 
# Detailed Table Information             
Database:               zmgdb                    
Owner:                  hadoop                   
CreateTime:             Fri Sep 23 19:21:15 CST 2016     
LastAccessTime:         UNKNOWN                  
Retention:              0                        
Location:               hdfs://hello110:9000/user/hive/warehouse/zmgdb.db/rc_t1  
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   {\"BASIC_STATS\":\"true\"}
        numFiles                0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1474629675          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe   
InputFormat:            org.apache.hadoop.hive.ql.io.RCFileInputFormat   
OutputFormat:           org.apache.hadoop.hive.ql.io.RCFileOutputFormat  

Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        serialization.format    1                   
Time taken: 0.135 seconds, Fetched: 30 row(s)
hive (zmgdb)> insert overwrite table rc_t1 select * from t2;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20160923192210_96320492-f8bf-483a-83c4-b9874fd05ef4
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1474629517907_0001, Tracking URL = http://hello110:8088/proxy/application_1474629517907_0001/
Kill Command = /home/hadoop/app/hadoop-2.7.2/bin/hadoop job  -kill job_1474629517907_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2016-09-23 19:22:22,091 Stage-1 map = 0%,  reduce = 0%
2016-09-23 19:22:28,446 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.83 sec
MapReduce Total cumulative CPU time: 1 seconds 830 msec
Ended Job = job_1474629517907_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to directory hdfs://hello110:9000/user/hive/warehouse/zmgdb.db/rc_t1/.hive-staging_hive_2016-09-23_19-22-10_649_8279187505632970863-1/-ext-10000
Loading data to table zmgdb.rc_t1
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 1.83 sec   HDFS Read: 4755 HDFS Write: 876 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 830 msec
OK
t2.id
Time taken: 19.126 seconds


hive (zmgdb)> select * from rc_t1;
OK
rc_t1.id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17




ORC

是RCfile的优化。自带了压缩和索引


存储总结

textfile 存储空间消耗比较大,并且压缩的text 无法分割和合并 查询的效率最低,可以直接存储,加载数据的速度最高

sequencefile 存储空间消耗大,压缩的文件可以分割和合并 查询效率高,需要通过text文件转化来加载

rcfile 存储空间最小,查询的效率最高 ,需要通过text文件转化来加载,加载的速度最低





评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

松门一枝花

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值