Hive(29):hive/hadoop的压缩格式选择

最新推荐文章于 2024-06-11 10:03:11 发布

RayBreslin

最新推荐文章于 2024-06-11 10:03:11 发布

阅读量257

点赞数

分类专栏：大数据开发 hadoop hive 文章标签：压缩格式 hive hadoop

本文链接：https://blog.csdn.net/u010886217/article/details/84056895

版权

大数据开发同时被 3 个专栏收录

204 篇文章 9 订阅

订阅专栏

hive

51 篇文章 3 订阅

订阅专栏

hadoop

25 篇文章 1 订阅

订阅专栏

一、功能实现

1.将存储数据压缩，减少存储空间。

2.hive的存储格式和压缩格式区别（如下图）

二、实现步骤

1.设置hadoop运行的任务的参数

（1）配置参数

（a）永久修改：在配置文件中修改：mapred-site.xml 改为之后重启hadoop
（b）临时修改：在执行中设置参数：-D 表示指定运行的参数，格式：key=vlaue

（2）执行使用的命令

bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /2015082818 /output2

（3）结果

普通输出：429.49 KB 压缩输出：83.9 KB

2.设置hive运行的任务的参数

（1）需要配置的参数

配置map：

这个参数是设置中间结果的压缩（true是支持，默认是false，是不支持）
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
</property>

mapreduce.map.output.compress=true
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec

配置reduce：

设置reduce输出的时候支持压缩
<property>
<name>hive.exec.compress.output</name>
<value>true</value>
</property>

mapreduce.output.fileoutputformat.compress=true
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec

（2）测试

（a）先创建一张源表

create table file_source(
id              string,
url             string,
referer         string,
keyword         string,
type            string,
guid            string,
pageId          string,
moduleId        string,
linkId          string,
attachedInfo    string,
sessionId       string,
trackerU        string,
trackerType     string,
ip              string,
trackerSrc      string,
cookie          string,
orderCode       string,
trackTime       string,
endUserId       string,
firstLink       string,
sessionViewNo   string,
productId       string,
curMerchantId   string,
provinceId      string,
cityId          string,
fee             string,
edmActivity     string,
edmEmail        string,
edmJobId        string,
ieVersion       string,
platform        string,
internalKeyword string,
resultSum       string,
currentPage     string,
linkPosition    string,
buttonPosition  string
)
row format delimited fields terminated by "\t";

load data local inpath '/opt/datas/2015082818' into table file_source;

（2）然后在创建一张不使用压缩方式insert数据的表

create table file_text(
id              string,
url             string,
referer         string,
keyword         string,
type            string,
guid            string,
pageId          string,
moduleId        string,
linkId          string,
attachedInfo    string,
sessionId       string,
trackerU        string,
trackerType     string,
ip              string,
trackerSrc      string,
cookie          string,
orderCode       string,
trackTime       string,
endUserId       string,
firstLink       string,
sessionViewNo   string,
productId       string,
curMerchantId   string,
provinceId      string,
cityId          string,
fee             string,
edmActivity     string,
edmEmail        string,
edmJobId        string,
ieVersion       string,
platform        string,
internalKeyword string,
resultSum       string,
currentPage     string,
linkPosition    string,
buttonPosition  string
)
row format delimited fields terminated by "\t"

insert into table file_text select * from file_source;

（3）设置相关参数：

set hive.exec.compress.intermediate=true;
set mapreduce.map.output.compress=true;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

（4）最后再创建一张新表，执行相同的insert语句

create table file_text_compress(
id              string,
url             string,
referer         string,
keyword         string,
type            string,
guid            string,
pageId          string,
moduleId        string,
linkId          string,
attachedInfo    string,
sessionId       string,
trackerU        string,
trackerType     string,
ip              string,
trackerSrc      string,
cookie          string,
orderCode       string,
trackTime       string,
endUserId       string,
firstLink       string,
sessionViewNo   string,
productId       string,
curMerchantId   string,
provinceId      string,
cityId          string,
fee             string,
edmActivity     string,
edmEmail        string,
edmJobId        string,
ieVersion       string,
platform        string,
internalKeyword string,
resultSum       string,
currentPage     string,
linkPosition    string,
buttonPosition  string
)
row format delimited fields terminated by "\t"

insert into table file_text_compress select * from file_source;

（5）结果