11.1 确定安装编解码器
# hive -e "set io.compression.codecs"
io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec
11.2 选择一种压缩编/解码器
11.3 开启中间压缩
若要开启中间压缩,需设置hive.exec.compress.intermediate
值为true
。默认为false
。
对Hadoop来说控制中间数据压缩属性是mapred.compress.map.output
。
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
<description> This controls whether intermediate files produced by Hive between
multiple map-reduce jobs are compressed. The compression codec and other options
are determined from hadoop config variables mapred.output.compress* </description>
</property>
Hadoop压缩默认的编/解码器是DefaultCodec
,可以修改mapred.map.output.compression.codec
。可以在$HADOOP_HOME/conf/mapred-site.xml</code>或<code>$HADOOP_HOME/conf/hive-site.xml
文件中进行配置。
<property>
<name>mapred.map.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.SnappyCodec</value>
<description> This controls whether intermediate files produced by Hive
between multiple map-reduce jobs are compressed. The compression codec
and other options are determined from hadoop config variables
mapred.output.compress* </description>
</property>
11.4 最终输出结果压缩
属性hive.exec.compress.output
控制最终输出结果的压缩。默认为false
。
<property>
<name>hive.exec.compress.output</name>
<value>false</value>
<description> This controls whether the final outputs of a query
(to a local/hdfs file or a Hive table) is compressed. The compression
codec and other options are determined from hadoop config variables
mapred.output.compress* </description>
</property>
Hadoop中使用属性mapred.output.compress
来控制。
当hive.exec.compress.output
值为true
时,需要指定一个编解码器。
<property>
<name>mapred.output.compression.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>
11.5 Sequence格式
在Hive中使用sequence
需要在CREATE TABLE
语句中使用STORED AS SEQUENCEFILE
来指定。
CREATE TABLE a_sequence_file_table STORED AS SEQUENCEFILE;
SEQUENCEFILE
提供了三种压缩方式。NONE
, RECORD
和BLOCK
。默认RECORD
,而通常BLOCK
级别压缩性能是最好的。可以在Hadoop的mapred-site.xml
文件或Hive的hive-site.xml
文件中指定。
<property>
<name>mapred.output.compression.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles,
how should they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
11.6 使用压缩实践
原数据:
hive> SELECT * FROM a;
4 5
3 2
hive> DESCRIBE a;
a int
b int
开启中间数据压缩
hive> set hive.exec.compress.intermediate=true;
hive> CREATE TABLE intermediate_comp_on
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/intermediate_comp_on
Table default.intermediate_comp_on stats: [num_partitions: 0, num_files: 1,
num_rows: 2, total_size: 8, raw_data_size: 6]
和预期一样,中间数据压缩没有影响最终输出。最终的结果输出仍是非压缩的。
hive> dfs -ls /user/hive/warehouse/intermediate_comp_on;
Found 1 items
/user/hive/warehouse/intermediate_comp_on/000000_0hive> dfs -cat /user/hive/warehouse/intermediate_comp_on/000000_0;
4 5
3 2
为中间数据压缩配置其编/解码器,而不使用默认的。选择GZip
压缩。
hive> set mapred.map.output.compression.codec
=org.apache.hadoop.io.compress.GZipCodec;
hive> set hive.exec.compress.intermediate=true;hive> CREATE TABLE intermediate_comp_on_gz
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/intermediate_comp_on_gz
Table default.intermediate_comp_on_gz stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 8, raw_data_size: 6]hive> dfs -cat /user/hive/warehouse/intermediate_comp_on_gz/000000_0;
4 5
3 2
开启输出结果压缩。
hive> set hive.exec.compress.output=true;
hive> CREATE TABLE final_comp_on
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/tmp/hive-edward/hive_2012-01-15_11-11-01_884_.../-ext-10001
Moving data to: file:/user/hive/warehouse/final_comp_on
Table default.final_comp_on stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 16, raw_data_size: 6]hive> dfs -ls /user/hive/warehouse/final_comp_on;
Found 1 items
/user/hive/warehouse/final_comp_on/000000_0.deflate
可以看到输出的文件后缀名是.deflate
。
hive> dfs -cat /user/hive/warehouse/final_comp_on/000000_0.deflate;
... UGLYBINARYHERE ...hive> SELECT * FROM final_comp_on;
4 5
3 2
改变输出结果压缩使用的编解码器
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec
=org.apache.hadoop.io.compress.GzipCodec;
hive> CREATE TABLE final_comp_on_gz
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/final_comp_on_gz
Table default.final_comp_on_gz stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 28, raw_data_size: 6]hive> dfs -ls /user/hive/warehouse/final_comp_on_gz;
Found 1 items
/user/hive/warehouse/final_comp_on_gz/000000_0.gz
最终输出了.gz
文件,使用zcat
命令查看:
hive> ! /bin/zcat /user/hive/warehouse/final_comp_on_gz/000000_0.gz;
4 5
3 2hive> SELECT * FROM final_comp_on_gz;
OK
4 5
3 2
Time taken: 0.159 seconds
使用sequence
文件格式
hive> set mapred.output.compression.type=BLOCK;
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;hive> CREATE TABLE final_comp_on_gz_seq
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS SEQUENCEFILE
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/final_comp_on_gz_seq
Table default.final_comp_on_gz_seq stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 199, raw_data_size: 6]
hive> dfs -ls /user/hive/warehouse/final_comp_on_gz_seq;
Found 1 items
/user/hive/warehouse/final_comp_on_gz_seq/000000_0
sequence
文件是二进制格式的。查询文件头:
hive> dfs -cat /user/hive/warehouse/final_comp_on_gz_seq/000000_0;
SEQ[]org.apache.hadoop.io.BytesWritable[]org.apache.hadoop.io.BytesWritable[]
org.apache.hadoop.io.compress.GzipCodec[]
使用hadoopdfs -text
命令去除sequence
文件头和压缩。
hive> dfs -text /user/hive/warehouse/final_comp_on_gz_seq/000000_0;
4 5
3 2hive> select * from final_comp_on_gz_seq;
OK
4 5
3 2
直接使用数据压缩和最终输出压缩数据
hive> set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> set hive.exec.compress.intermediate=true;
hive> set mapred.output.compression.type=BLOCK;
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;hive> CREATE TABLE final_comp_on_gz_int_compress_snappy_seq
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS SEQUENCEFILE AS SELECT * FROM a;
11.7 存档分区
Hadoop中有一种存储格式为HAR
,即Hadoop Archive(Hadoop归档文件)
。一个HAR
文件类似与HDFS文件系统中的一个TAR
文件,HAR
查询效率不高,并非是压缩的。不会节约存储空间,仅仅是减轻NameNode
压力。
hive> CREATE TABLE hive_text (line STRING) PARTITIONED BY (folder STRING);
hive> ! ls $HIVE_HOME;
LICENSE
README.txt
RELEASE_NOTES.txt
hive> ALTER TABLE hive_text ADD PARTITION (folder='docs');
hive> LOAD DATA INPATH '${env:HIVE_HOME}/README.txt'
> INTO TABLE hive_text PARTITION (folder='docs');
Loading data to table default.hive_text partition (folder=docs)
hive> LOAD DATA INPATH '${env:HIVE_HOME}/RELEASE_NOTES.txt'
> INTO TABLE hive_text PARTITION (folder='docs');
Loading data to table default.hive_text partition (folder=docs)
hive> SELECT * FROM hive_text WHERE line LIKE '%hive%' LIMIT 2;
http://hive.apache.org/ docs
- Hive 0.8.0 ignores the hive-default.xml file, though we continue docs
ALTER TABLE ... ARCHIVE PARTITION
语句将表转化成一个归档表。如:
hive> SET hive.archive.enabled=true;
hive> ALTER TABLE hive_text ARCHIVE PARTITION (folder='docs');
intermediate.archived is
file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
intermediate.original is
file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ORIGINAL
Creating data.har for file:/user/hive/warehouse/hive_text/folder=docs
in file:/tmp/hive-edward/hive_..._3862901820512961909/-ext-10000/partlevel
Please wait... (this may take a while)
Moving file:/tmp/hive-edward/hive_..._3862901820512961909/-ext-10000/partlevel
to file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
Moving file:/user/hive/warehouse/hive_text/folder=docs
to file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ORIGINAL
Moving file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
to file:/user/hive/warehouse/hive_text/folder=docs
hive> dfs -ls /user/hive/warehouse/hive_text/folder=docs;
Found 1 items
/user/hive/warehouse/hive_text/folder=docs/data.har
ALTER TABLE ... UNARCHIVE PARTITION
解压HAR
中的文件到HDFS
ALTER TABLE hive_text UNARCHIVE PARTITION (folder='docs');