Hadoop的I/O操作
一、压缩
文件压缩有两大好处:
- 减少存储文件所需要的磁盘空间
- 加速数据在网络和磁盘上的传输
在Hadoop中更是如此。
Codec
Codec是压缩-解压缩算法的一种实现,在Hadoop中用CompressionCodec(接口)的实现代表一个Codec,常见的实现类有DefaultCdec(implements CompressionCodec),GzipCodec(extends DefaultCodec ),BZip2Codec,LzopCodec,Lz4Codec,SnappyCodec分别对应相应的压缩-解压算法。
1)CompressionCodec要对写入输出数据流的数据进行压缩,用到函数
CompressionOutputStream createOutputStream(OutputStream out) throws IOException;
Demo:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.util.ReflectionUtils;
public class StreamCompressor {
public static void main(String[] args) throws Exception{
String codeClassName = args[0];
Configuration configuration = new Configuration();
Class<?> className = Class.forName(codeClassName);
CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(className, configuration);
CompressionOutputStream outputStream = codec.createOutputStream(System.out);
IOUtils.copyBytes(System.in, outputStream, 4096,false);
outputStream.flush();
}
}
用GzipCodec压缩"Text"并用gunzip解压缩,命令行中输入
echo "Text" | hadoop StreamCompressor org.apache.hadoop.io.compress.GzipCodec \| gunzip
2)CompressionCodec要对输入数据流中数据进行读取时进行解压缩,用到函数
CompressionInputStream createInputStream(InputStream in) throws IOException;
CompressionOutputStream 和 CompressionInputStream 重置底层的压缩和解压缩方法,可以将部分数据流压缩为单独的数据块(BLOCK),在下面介绍的SequenceFile的格式中有应用。
通过CompressionCodecFactory推断CompressionCodec
Demo代码:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import java.io.InputStream;
import java.io.OutputStream;
public class FileDecompressor {
public static void main(String[] args) throws Exception{
String uri = args[1];
Path inputPath = new Path(uri);
Configuration configuration = new Configuration();
CompressionCodecFactory factory = new CompressionCodecFactory(configuration);
//getCodec()拿到Codec
CompressionCodec compressionCodec = factory.getCodec(inputPath);
//拿到codec之后用工厂的静态方法removeSuffix()去除后缀形成输出文件名,如file.gz得到的解压缩文件名就为file
String outputUri = CompressionCodecFactory.removeSuffix(uri,compressionCodec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
FileSystem fs = FileSystem.get(configuration);
try{
in = compressionCodec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, configuration);
}finally {
IOUtils.closeStream(out);
IOUtils.closeStream(in);
}
}
}
CodecPool
用到原生库(native,可以提高压缩与解压缩的性能)并且需要重复压缩与解压缩时可以用到CodecPool,通过返回compressor在不同数据流之间来回复制数据。
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CodecPool;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Compressor;
import org.apache.hadoop.util.ReflectionUtils;
public class PooledStreamCompressor {
public static void main(String[] args) throws Exception{
String codecClassName = args[0];
Class<?> codecClass = Class.forName