Hive高级部分2

压缩:Compression
    需要考虑的要素:
    一:压缩比
    二:解压速度

常见压缩格式对比:



压缩在MapReduce里的应用:

hadoop checknative命令查看支持的压缩格式,需要自行编译以支持各种格式
codec:我们只需要配置在hadoop的配置文件中即可

压缩的使用
core-site.xml
//需要哪些格式加什么的代码,这里只加三种
<property>
    <name>io.compression.codecs</name>
    <value>
    org.apache.hadoop.io.compress.GzipCodec,
    org.apache.hadoop.io.compress.DefaultCodec,
    org.apache.hadoop.io.compress.BZip2Codec, //常用
    com.hadoop.compression.lzo.LzopCodec,       //常用
    org.apache.hadoop.io.compress.Lz4Codec,
    org.apache.hadoop.io.compress.Snappycodec,
    </value>
</property>

mapred-site.xml
<property>
    <name>mapreduce.output.fileoutputformat.compress</name>
    <value>true</value>
</property>

<property>
    <name>mapreduce.output.fileoutputformat.compress.codec</name>
    <value>org.apache.hadoop.io.compress.BZip2Codec</value>//指定使用BZ2压缩格式
</property>

实验:创建第一个表,用于给后续的表插入数据
create table ruoze_page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t';
load data local inpath '/home/hadoop/data/page_views.dat' overwrite into table  ruoze_page_views;

BZ2示例:Hive的结果以BZ2的格式压缩输出:
开启Hive的压缩输出,并指定为BZ2的格式 
SET hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;
在Hive创建一个Hive的结果以BZ2的格式压缩输出表
create table ruoze_page_views_bzip2
row format delimited fields terminated by '\t'
as select * from ruoze_page_views; 
关闭压缩功能
set hive.exec.compress.output=false;
-----------------------
行式存储 vs 列式存储

行式存储的优点:

同一行数据存放在同一个block块里面,select * from table_name;数据能直接获取出来;

 INSERT/UPDATE比较方便

行式存储的缺点:

不同类型数据存放在同一个block块里面,压缩性能不好;

select id,name from table_name;这种类型的列查询,所有数据都要读取,而不能跳过。

列式存储的优点:

同类型数据存放在同一个block块里面,压缩性能好;

任何列都能作为索引。

列式存储的缺点:

select * from table_name;这类全表查询,需要数据重组;

INSERT/UPDATE比较麻烦。

Storage Format 介绍:
STORED AS file_format

TextFile示例:

create table ruoze_b(id int) stored as 
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

SequenceFile示例

create table ruoze_page_views_seq(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string

row format delimited fields terminated by '\t'
stored as SEQUENCEFILE;
insert into table ruoze_page_views_seq select * from ruoze_page_views;

RcFile示例:
create table ruoze_page_views_rc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string

row format delimited fields terminated by '\t'
stored as rcfile; 
insert into table ruoze_page_views_rc select * from ruoze_page_views;

Orc示例:
create table ruoze_page_views_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string

row format delimited fields terminated by '\t'
stored as orc; 
insert into table ruoze_page_views_orc select * from ruoze_page_views;

不使用orc压缩功能:
create table ruoze_page_views_orc_null(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string

row format delimited fields terminated by '\t'
stored as orc tblproperties ("orc.compress"="NONE"); 
insert into table ruoze_page_views_orc_null select * from ruoze_page_views;


parquet示例:

create table ruoze_page_views_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string

row format delimited fields terminated by '\t'
stored as parquet; 
insert into table ruoze_page_views_parquet select * from ruoze_page_views;

GZIP示例:
set parquet.compression=GZIP;
create table ruoze_page_views_parquet_gzip 
row format delimited fields terminated by '\t'
stored as parquet
as select * from ruoze_page_views;

文件大小对比:

行式储存和列式储存操作时的对比:
行式储存:
select count(1) from ruoze_page_views where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
读取的文件数:19022752

列式储存:
orc:
select count(1) from ruoze_page_views_orc where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
读取的文件数:1257523
parquet:
查一列:
select count(1) from ruoze_page_views_parquet where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
读取的文件数:2687077
查两列:
select count(1) from ruoze_page_views_parquet where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1' and ip='1';
读取的文件数:3496487

阅读更多

没有更多推荐了,返回首页