Hive支持的数据表的存储数据格式,主要有四种:
行式存储:textFile、SequenceFile
列式存储:orc、parquet
一、测试4种存储格式的压缩比
测试数据log.data,查看属性为18.1MB
1. textFile格式
textfile格式是hive表存储的默认格式,数据不做压缩,磁盘开销大,数据解析开销大。
(1)创建表,存储格式为textfile
create table log_text(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as textfile;
(2)导入数据
load data local inpath '/opt/module/datas/log.data' into table log_text;
(3)查看HDFS中数据表大小
dfs -du -h /user/hive/warehouse/hive_db1.db/log_text;
18.1 M /user/hive/warehouse/hive_db1.db/log_text/log.data
textfile无压缩,依然为18.1M
2. sequencefile格式
(1)创建表,存储格式为sequencefile
create table log_seq(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as sequencefile;
(2)导入数据
insert into table log_seq select * from log_text;
(3)查看HDFS上数据表大小
dfs -du -h /user/hive/warehouse/hive_db1.db/log_seq;
19.6 M /user/hive/warehouse/hive_db1.db/log_seq/000000_0
sequencefile格式压缩后,18.1M变成了19.6M,明显不合适
3. orc格式
(1)创建表,存储格式为orc
create table log_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc;
(2)导入数据
insert into table log_orc select * from log_text;
(3)查看HDFS上数据表大小
dfs -du -h /user/hive/warehouse/hive_db1.db/log_orc;
2.8 M /user/hive/warehouse/hive_db1.db/log_orc/000000_0
orc格式压缩后,18.1M变成了2.8M
4. parquet格式
(1)创建表,存储格式为parquet
create table log_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet;
(2)导入数据
insert into table log_parquet select * from log_text;
(3)查看HDFS上数据表大小
dfs -du -h /user/hive/warehouse/hive_db1.db/log_parquet;
13.1 M /user/hive/warehouse/hive_db1.db/log_parquet/000000_0
parquet格式,18.1M压缩后变成13.1M
5. 四种主流格式的压缩比大小:
orc > parquet > textfile > sequencefile
二、测试四种主流格式的查询速率
1. textfile格式
hive (hive_db1)> select count(*) from log_text;
_c0
100000
Time taken: 23.514 seconds, Fetched: 1 row(s)
2. sequencefile格式
hive (hive_db1)> select count(*) from log_seq;
_c0
100000
Time taken: 19.143 seconds, Fetched: 1 row(s)
3. orc格式
hive (hive_db1)> select count(*) from log_orc;
_c0
100000
Time taken: 18.239 seconds, Fetched: 1 row(s)
4. parquet格式
hive (hive_db1)> select count(*) from log_parquet;
_c0
100000
Time taken: 19.018 seconds, Fetched: 1 row(s)
结论:同样查询10w行数据,四种主流格式查询速率相近,orc格式最快,textfile格式最慢。
三、测试none、snappy、lzo压缩格式的文件大小
以压缩比最大的orc文件为例,进行测试
orc.compress = (NONE, ZLIB, SNAPPY)
1. 压缩方式选none
(1)创建表
create table log_orc_none(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties("orc.compress"="NONE");
(2)插入数据
insert into table log_orc_none select * from log_text;
(3)查看表数据大小
dfs -du -h /user/hive/warehouse/hive_db1.db/log_orc_none;
7.7 M /user/hive/warehouse/hive_db1.db/log_orc_none/000000_0
2. 压缩方式snappy和默认zlib
同样的处理方式,snappy压缩后大小为3.8M,原始orc存储格式默认采用ZLIB压缩,2.8M,采用deflate算法,比snappy的压缩还小。