一、功能实现
1.将存储数据压缩,减少存储空间。
2.hive的存储格式和压缩格式区别(如下图)
二、实现步骤
1.设置hadoop运行的任务的参数
(1)配置参数
(a)永久修改:在配置文件中修改:mapred-site.xml 改为之后重启hadoop
(b)临时修改:在执行中设置参数:-D 表示指定运行的参数,格式:key=vlaue
(2)执行使用的命令
bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount -Dmapreduce.map.output.compress=true -Dmapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec -Dmapreduce.output.fileoutputformat.compress=true -Dmapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec /2015082818 /output2
(3)结果
普通输出:429.49 KB 压缩输出:83.9 KB
2.设置hive运行的任务的参数
(1)需要配置的参数
配置map:
这个参数是设置中间结果的压缩(true是支持,默认是false,是不支持)
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
</property>
mapreduce.map.output.compress=true
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
配置reduce:
设置reduce输出的时候支持压缩
<property>
<name>hive.exec.compress.output</name>
<value>true</value>
</property>
mapreduce.output.fileoutputformat.compress=true
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
(2)测试
(a)先创建一张源表
create table file_source(
id string,
url string,
referer string,
keyword string,
type string,
guid string,
pageId string,
moduleId string,
linkId string,
attachedInfo string,
sessionId string,
trackerU string,
trackerType string,
ip string,
trackerSrc string,
cookie string,
orderCode string,
trackTime string,
endUserId string,
firstLink string,
sessionViewNo string,
productId string,
curMerchantId string,
provinceId string,
cityId string,
fee string,
edmActivity string,
edmEmail string,
edmJobId string,
ieVersion string,
platform string,
internalKeyword string,
resultSum string,
currentPage string,
linkPosition string,
buttonPosition string
)
row format delimited fields terminated by "\t";
load data local inpath '/opt/datas/2015082818' into table file_source;
(2)然后在创建一张不使用压缩方式insert数据的表
create table file_text(
id string,
url string,
referer string,
keyword string,
type string,
guid string,
pageId string,
moduleId string,
linkId string,
attachedInfo string,
sessionId string,
trackerU string,
trackerType string,
ip string,
trackerSrc string,
cookie string,
orderCode string,
trackTime string,
endUserId string,
firstLink string,
sessionViewNo string,
productId string,
curMerchantId string,
provinceId string,
cityId string,
fee string,
edmActivity string,
edmEmail string,
edmJobId string,
ieVersion string,
platform string,
internalKeyword string,
resultSum string,
currentPage string,
linkPosition string,
buttonPosition string
)
row format delimited fields terminated by "\t"
insert into table file_text select * from file_source;
(3)设置相关参数:
set hive.exec.compress.intermediate=true;
set mapreduce.map.output.compress=true;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
(4)最后再创建一张新表,执行相同的insert语句
create table file_text_compress(
id string,
url string,
referer string,
keyword string,
type string,
guid string,
pageId string,
moduleId string,
linkId string,
attachedInfo string,
sessionId string,
trackerU string,
trackerType string,
ip string,
trackerSrc string,
cookie string,
orderCode string,
trackTime string,
endUserId string,
firstLink string,
sessionViewNo string,
productId string,
curMerchantId string,
provinceId string,
cityId string,
fee string,
edmActivity string,
edmEmail string,
edmJobId string,
ieVersion string,
platform string,
internalKeyword string,
resultSum string,
currentPage string,
linkPosition string,
buttonPosition string
)
row format delimited fields terminated by "\t"
insert into table file_text_compress select * from file_source;
(5)结果
普通: 27.48 MB 压缩: 14.94 MB