压缩:Compression
需要考虑的要素:
一:压缩比
二:解压速度
常见压缩格式对比:
压缩在MapReduce里的应用:
hadoop checknative命令查看支持的压缩格式,需要自行编译以支持各种格式
codec:我们只需要配置在hadoop的配置文件中即可
压缩的使用
core-site.xml
//需要哪些格式加什么的代码,这里只加三种
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec, //常用
com.hadoop.compression.lzo.LzopCodec, //常用
org.apache.hadoop.io.compress.Lz4Codec,
org.apache.hadoop.io.compress.Snappycodec,
</value>
</property>
mapred-site.xml
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>//指定使用BZ2压缩格式
</property>
实验:创建第一个表,用于给后续的表插入数据
create table ruoze_page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t';
load data local inpath '/home/hadoop/data/page_views.dat' overwrite into table ruoze_page_views;
BZ2示例:Hive的结果以BZ2的格式压缩输出:
开启Hive的压缩输出,并指定为BZ2的格式
SET hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;
在Hive创建一个Hive的结果以BZ2的格式压缩输出表
create table ruoze_page_views_bzip2
row format delimited fields terminated by '\t'
as select * from ruoze_page_views;
关闭压缩功能
set hive.exec.compress.output=false;
-----------------------
行式存储 vs 列式存储
行式存储的优点:
同一行数据存放在同一个block块里面,select * from table_name;数据能直接获取出来;
INSERT/UPDATE比较方便
行式存储的缺点:
不同类型数据存放在同一个block块里面,压缩性能不好;
select id,name from table_name;这种类型的列查询,所有数据都要读取,而不能跳过。
列式存储的优点:
同类型数据存放在同一个block块里面,压缩性能好;
任何列都能作为索引。
列式存储的缺点:
select * from table_name;这类全表查询,需要数据重组;
INSERT/UPDATE比较麻烦。
Storage Format 介绍:
STORED AS file_format
TextFile示例:
create table ruoze_b(id int) stored as
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
SequenceFile示例
create table ruoze_page_views_seq(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as SEQUENCEFILE;
insert into table ruoze_page_views_seq select * from ruoze_page_views;
RcFile示例:
create table ruoze_page_views_rc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as rcfile;
insert into table ruoze_page_views_rc select * from ruoze_page_views;
Orc示例:
create table ruoze_page_views_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc;
insert into table ruoze_page_views_orc select * from ruoze_page_views;
不使用orc压缩功能:
create table ruoze_page_views_orc_null(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties ("orc.compress"="NONE");
insert into table ruoze_page_views_orc_null select * from ruoze_page_views;
parquet示例:
create table ruoze_page_views_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet;
insert into table ruoze_page_views_parquet select * from ruoze_page_views;
GZIP示例:
set parquet.compression=GZIP;
create table ruoze_page_views_parquet_gzip
row format delimited fields terminated by '\t'
stored as parquet
as select * from ruoze_page_views;
文件大小对比:
行式储存和列式储存操作时的对比:
行式储存:
select count(1) from ruoze_page_views where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
读取的文件数:19022752
列式储存:
orc:
select count(1) from ruoze_page_views_orc where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
读取的文件数:1257523
parquet:
查一列:
select count(1) from ruoze_page_views_parquet where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
读取的文件数:2687077
查两列:
select count(1) from ruoze_page_views_parquet where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1' and ip='1';
读取的文件数:3496487