压缩算法及其编码/解码器
压缩格式 | 对应的编码/解码器 |
DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzipCodec |
bzip | org.apache.hadoop.io.compress.BZip2Codec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
压缩过程实现:
接受一个字符串参数,用于指定编码/解码器,使用反射机制创建对应的并对相应的编码解码对象,对文件进行压缩。
public static void compress(String method) throws ClassNotFoundException, IOException {
File fileIn = new File("adult.data");
//输入流
FileInputStream in = new FileInputStream(fileIn);
Class<?> codecClass = Class.forName(method);
Configuration conf = new Configuration();
//通过名称找对应的编码/解码器
CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
File fileOut = new File("adult.data" + codec.getDefaultExtension());
fileOut.delete();
//文件输出流
FileOutputStream out = new FileOutputStream(fileOut);
//通过编码/解码器创建对应的输出流
CompressionOutputStream cout = codec.createOutputStream(out);
//压缩
IOUtils.copyBytes(in,cout,4096,false);
in.close();
cout.close();
}
解压缩过程实现:
解压文件时,通常通过指定其拓展名来推断解码器。
public static void decompress(File file) throws IOException {
Configuration conf = new Configuration();
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
//通过文件拓展名获得相应的编码/解码器
CompressionCodec codec = factory.getCodec(new Path(file.getName()));
if(codec == null){
System.out.println("Cannot find codec for file " + file);
}
File fileOut = new File(file.getName());
//通过编码/解码器创建对应的输入流
CompressionInputStream in = codec.createInputStream(new FileInputStream(file));
FileOutputStream out = new FileOutputStream(new File("adult.data.decompress"));
IOUtils.copyBytes(in,out,4096,false);
in.close();
out.close();
}