存储和压缩结合

官网:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

ORC存储方式的压缩:

Key

Default

Notes

orc.compress

ZLIB

high level compression (one of NONE, ZLIB, SNAPPY)

orc.compress.size

262,144

number of bytes in each compression chunk

orc.stripe.size

67,108,864

number of bytes in each stripe

orc.row.index.stride

10,000

number of rows between index entries (must be >= 1000)

orc.create.index

true

whether to create row indexes

orc.bloom.filter.columns

""

comma separated list of column names for which bloom filter should be created

orc.bloom.filter.fpp

0.05

false positive probability for bloom filter (must >0.0 and <1.0)

1、创建一个非压缩的的ORC存储方式

(1)建表语句

create table log_orc_none(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS orc tblproperties ("orc.compress"="NONE");

(2)插入数据

insert into table log_orc_none select * from log_text1 ;

(3)查看插入后数据

dfs -du -h /user/hive/warehouse/myhive.db/log_orc_none;

 

7.7 M  /user/hive/warehouse/log_orc_none/123456_0

2、创建一个SNAPPY压缩的ORC存储方式

(1)建表语句

create table log_orc_snappy(

track_time string,

url string,

session_id string,

referer string,

ip string,

end_user_id string,

city_id string

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'

STORED AS orc tblproperties ("orc.compress"="SNAPPY");

(2)插入数据

insert into table log_orc_snappy select * from log_text1 ;

(3)查看插入后数据

dfs -du -h /user/hive/warehouse/myhive.db/log_orc_snappy ;

 

3.8 M  /user/hive/warehouse/log_orc_snappy/123456_0

3、上一篇博文中默认创建的ORC存储方式,导入数据后的大小为

 

2.8 M  /user/hive/warehouse/log_orc/123456_0

比Snappy压缩的还小。原因是orc存储文件默认采用ZLIB压缩。比snappy压缩的小。

4、存储方式和压缩总结:

在实际的项目开发当中,hive表的数据存储格式一般选择:orc或parquet。压缩方式一般选择snappy。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值