Snappy,Lzo,bzip2,gzip,deflate 都是hive常用的文件压缩格式,各有所长,这里咱们只关注具体文件的解压
一、先贴代码:
package compress;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;
public class Decompress {
public static final Log LOG = LogFactory.getLog(Decompress.class.getName());
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String name = "io.compression.codecs";
String value = "org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec";
conf.set(name, value);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
for (int i = 0; i < args.length; ++i) {
CompressionCodec codec = factory.getCodec(new Path(args[i]));
if (codec == null) {
System.out.println("Codec for " + args[i] + " not found.");
} else {
CompressionInputStream in = null;
try {
in = codec.createInputStream(new java.io.FileInputStream(
args[i]));
byte[] buffer = new byte[100];
int len = in.read(buffer);
while (len > 0) {
System.out.write(buffer, 0, len);
len = in.read(buffer);
}
} finally {
if (in != null) {
in.close();
}
}
}
}
}
}
二、准备工作
1、准备依赖
简要说明一下,这几种压缩文件相关的核心类为:
org.apache.hadoop.io.compress.SnappyCodec
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.DefaultCodec,
首先我们需要这些依赖,我把解压需要的依赖都放在了 /home/apache/test/lib/ 目录下
此外还需要文件压缩需要的本地库文件,找到一台装有hadoop的环境,将 $HADOOP_HOME/lib/native 目录复制过来,我放到了 /tmp/decompress 目录下
2、准备压缩文件
2.1、Snappy 文件
因为我没安装Snappy库,所以就用hive来创建snappy压缩文件:
这只需要两个参数:
hive.exec.compress.output 设置为 true 来声明将结果文件进行压缩
mapred.output.compression.codec 用来设置具体的结果文件压缩格式
在 hive shell 中检查这两个参数,设置为我们需要的 Snappy 格式后&