在上一篇hive压缩的文章中,我用BZip2、Snappy等方式对其进行压缩,在压缩效果上显然BZip2的效果更佳,但其他压缩方式也因其不同的优点各有应用。而在之前的压缩,所应用的文件格式均为Textfile。我们可以通过 desc formatted tablename 查看文件的格式:
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
从输入输出可以看出格式为Textfile,但是我认为只有合适的文件格式加上合适的压缩方式才是最合适的即:Storage Format+Compress。
SEQUENCEFILE格式
- 架构图
从架构可见问题,record length和key length的存在会浪费大量空间,所以修改textFile为sequenceFile时,sequenceFile占用空间会更大,所以并不推荐,
- 测试:
create table page_views_seq(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY “\t”
STORED AS SEQUENCEFILE;
insert into table page_views_seq select * from page_views;
hadoop:hadoop:/home/hadoop:>hadoop fs -du -h /user/hive/warehouse
18.1 M 18.1 M /user/hive/warehouse/page_views
19.1 M 19.1 M /user/hive/warehouse/page_views_seq
RCFILE格式
- 架构图
这个架构是利用了先行式存储再列式存储的方式。在每个row group中为列式存储,相对于sequenceFile的大小,有一定优势,大概比textFile小10%。
- 测试
create table page_views_rc3(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)ROW FORMAT DELIMITED FIELDS TERMINATED BY “\t”
STORED AS rcfile;
insert into table page_views_rc3 select * from page_views;
hadoop:hadoop:/home/hadoop:>hadoop fs -du -h /user/hive/warehouse
0 0 /user/hive/warehouse/.hive-staging_hive_2018-03-04_19-48-25_047_3546732262703318818-1
18.1 M 18.1 M /user/hive/warehouse/page_views
17.9 M 17.9 M /user/hive/warehouse/page_views_rc3
ORCFILE格式
orcFile是对rcFile的优化,其将文件分为多个stripe,默认每个stripe为250M,在stripe中默认对每10000行进行一个索引(index data)。这个索引记录了这些行中的max和min,对于查询,有更明显的优化
- 测试
create table page_views_orc
ROW FORMAT DELIMITED FIELDS TERMINATED BY “\t”
stored as orc
tblproperties(“orc.compress”=”NONE”)
as select * from page_views;
hadoop:hadoop:/home/hadoop:>hadoop fs -du -h /user/hive/warehouse
0 0 /user/hive/warehouse/.hive-staging_hive_2018-03-04_19-48-25_047_3546732262703318818-1
18.1 M 18.1 M /user/hive/warehouse/page_views
7.7 M 7.7 M /user/hive/warehouse/page_views_orc
PARQUET格式
- 测试
create table page_views_parquet
ROW FORMAT DELIMITED FIELDS TERMINATED BY “\t”
stored as parquet
as select * from page_views;
hadoop:hadoop:/home/hadoop:>hadoop fs -du -h /user/hive/warehouse
0 0 /user/hive/warehouse/.hive-staging_hive_2018-03-04_19-48-25_047_3546732262703318818-1
13.1 M 13.1 M /user/hive/warehouse/page_views_parquet
对比HDFS读取
1.select count(1) from page_views where session_id=’B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1’;
HDFS Read: 19022674
2.select count(1) from page_views_seq where session_id=’B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1’;
HDFS Read: 20080823
3.select count(1) from page_views_rc3 where session_id=’B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1’;
HDFS Read: 3725340
4.select count(1) from page_views_orc where session_id=’B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1’;
HDFS Read: 1810060
总结:可以看出,格式不同,HDFS的读取也会变小,速度自然随之更快
LZO+Storage Format
rcfile:
create table page_views_lzo5_rc2
ROW FORMAT DELIMITED FIELDS TERMINATED BY “\t”
STORED AS rcfile
as select * from page_views;insert into table page_views_lzo5_rc select * from page_views;
parquet:
create table page_views_parquet
ROW FORMAT DELIMITED FIELDS TERMINATED BY “\t”
stored as parquet
as select * from page_views;
hadoop:hadoop:/home/hadoop:>hadoop fs -du -h /user/hive/warehouse
0 0 /user/hive/warehouse/.hive-staging_hive_2018-03-04_19-48-25_047_3546732262703318818-1
18.1 M 18.1 M /user/hive/warehouse/page_views
6.2 M 6.2 M /user/hive/warehouse/page_views_lzo5_parquet
7.8 M 7.8 M /user/hive/warehouse/page_views_lzo5_rc