主要是对CDH6.0.1平台,Hive的压缩进行设置。
查看Hive支持的压缩方式
set io.compression.codecs;
io.compression.codecs=
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.SnappyCodec,
org.apache.hadoop.io.compress.Lz4Codec
查看Hive默认文件类型
set hive.default.fileformat;
hive.default.fileformat=TextFile
# 其他
sequencefile 二进制可分割类型,NONE,RECORD,BLOCK三种方式,一般BLOCK
rcfile 行列结合存储方式
orcfile rcfile的升级版,建议
查看Hive中Orc和parquet默认压缩格式
set orc.compress;
set parquet.compress;
采用ORC+Snappy压缩是比较常用的格式,CDH6已经自动部署了Snappy压缩。
这是没有分区表的压缩前和压缩后大小。
建表时启用压缩
CREATE TABLE `virtual_payment_cp` (
`ID` bigint,
`DEVICE_CODE` string COMMENT 'xx',
`LOGIN_ACCOUNT` string COMMENT 'xx',
`AMOUNT` decimal(11,2) COMMENT 'xx',
`PAY_RESULT` int COMMENT 'xx',
`CP_GAME_ID` bigint COMMENT 'xx'
) PARTITIONED BY(`DATE` STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS orc tblproperties ("orc.compress"="SNAPPY");
Hive表启用压缩
Hive会话中临时设置
# 中间传输数据压缩,就是map产生的数据,等同于map端压缩设置,作用一样
set hive.exec.compress.intermediate=true;
set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
# 输出数据压缩,等同于reduce端输出压缩
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
# Map端压缩,等同于hive中间压缩
set mapred.map.output.compress=true;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
# Reduce端输出压缩,等同于hive输出数据压缩
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
# 设置序列化,Sequencefile的设置
set mapred.map.output.compression.type=BLOCK;
set mapred.output.compression.type=BLOCK;
CDH -> Hive -> hive-site.xml 客户端
<property><name>hive.exec.compress.intermediate</name><value>true</value></property>
<property><name>hive.intermediate.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value></property>
<property><name>hive.exec.compress.output</name><value>true</value></property>
<property><name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value></property>
YARN中设置
Hadoop平台设置 mapred-site.xml
<property><name>mapreduce.map.output.compress</name><value>true</value></property>
<property><name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value></property>
<property><name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value></property>
<property><name>mapreduce.output.fileoutputformat.compress</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value></property>
其他压缩,如:
Parquet+Snappy
set parquet.compression=SNAPPY;
# 或者建表时
..
STORED AS parquet tblproperties("parquet.compression"="SNAPPY");