Hive高级部分2

最新推荐文章于 2024-05-14 15:23:36 发布

烙痕

最新推荐文章于 2024-05-14 15:23:36 发布

阅读量422

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/qq_37408712/article/details/81592907

版权

Hive 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

压缩：Compression
需要考虑的要素：
一：压缩比
二：解压速度

常见压缩格式对比：

压缩在MapReduce里的应用：

hadoop checknative命令查看支持的压缩格式，需要自行编译以支持各种格式
codec：我们只需要配置在hadoop的配置文件中即可

压缩的使用
core-site.xml
//需要哪些格式加什么的代码，这里只加三种
<property>
<name>io.compression.codecs</name>
   <value>
   org.apache.hadoop.io.compress.GzipCodec,
   org.apache.hadoop.io.compress.DefaultCodec,
   org.apache.hadoop.io.compress.BZip2Codec, //常用
com.hadoop.compression.lzo.LzopCodec, //常用
  org.apache.hadoop.io.compress.Lz4Codec,
  org.apache.hadoop.io.compress.Snappycodec,
</value>
</property>

mapred-site.xml
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>

<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.BZip2Codec</value>//指定使用BZ2压缩格式
</property>

实验：创建第一个表，用于给后续的表插入数据
create table ruoze_page_views(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t';
load data local inpath '/home/hadoop/data/page_views.dat' overwrite into table ruoze_page_views;

BZ2示例：Hive的结果以BZ2的格式压缩输出：
开启Hive的压缩输出，并指定为BZ2的格式
SET hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codec;
在Hive创建一个Hive的结果以BZ2的格式压缩输出表
create table ruoze_page_views_bzip2
row format delimited fields terminated by '\t'
as select * from ruoze_page_views;
关闭压缩功能
set hive.exec.compress.output=false;
-----------------------
行式存储 vs 列式存储

行式存储的优点：

同一行数据存放在同一个block块里面，select * from table_name;数据能直接获取出来；

INSERT/UPDATE比较方便

行式存储的缺点：

不同类型数据存放在同一个block块里面，压缩性能不好；

select id,name from table_name;这种类型的列查询，所有数据都要读取，而不能跳过。

列式存储的优点：

同类型数据存放在同一个block块里面，压缩性能好；

任何列都能作为索引。

列式存储的缺点：

select * from table_name;这类全表查询，需要数据重组；

INSERT/UPDATE比较麻烦。

Storage Format 介绍：
STORED AS file_format

TextFile示例：

create table ruoze_b(id int) stored as
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

SequenceFile示例

create table ruoze_page_views_seq(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as SEQUENCEFILE;
insert into table ruoze_page_views_seq select * from ruoze_page_views;

RcFile示例：
create table ruoze_page_views_rc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as rcfile;
insert into table ruoze_page_views_rc select * from ruoze_page_views;

Orc示例：
create table ruoze_page_views_orc(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc;
insert into table ruoze_page_views_orc select * from ruoze_page_views;

不使用orc压缩功能：
create table ruoze_page_views_orc_null(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties ("orc.compress"="NONE");
insert into table ruoze_page_views_orc_null select * from ruoze_page_views;

parquet示例：

create table ruoze_page_views_parquet(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet;
insert into table ruoze_page_views_parquet select * from ruoze_page_views;

GZIP示例：
set parquet.compression=GZIP;
create table ruoze_page_views_parquet_gzip
row format delimited fields terminated by '\t'
stored as parquet
as select * from ruoze_page_views;

文件大小对比：

行式储存和列式储存操作时的对比：
行式储存：
select count(1) from ruoze_page_views where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
读取的文件数：19022752

列式储存：
orc：
select count(1) from ruoze_page_views_orc where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
读取的文件数：1257523
parquet：
查一列：
select count(1) from ruoze_page_views_parquet where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1';
读取的文件数：2687077
查两列：
select count(1) from ruoze_page_views_parquet where session_id='B58W48U4WKZCJ5D1T3Z9ZY88RU7QA7B1' and ip='1';
读取的文件数：3496487