Hive的数据压缩
- apahce官方提供的hadoop安装包不支持数据的压缩,所以需要编译hadoop源码
hadoop源码编译方法:
- 1.安装sanppy压缩库
- 2.编译hadoop 2.x源码
- 3.mvn package -Pdist,native -DskipTests -Dtar -Drequire.snappy
- 4.编译完成后,将hadoop-2.x/target/hadoop-2.x/lib/native目录下的文件拷贝到hadoop安装目录下对应的lib目录下
- 5.使用hadoop checknative命令,检查snappy压缩方式是否为true
- 6.在Hive中设置压缩方式
- set mapreduce.map.output.compress=true
- set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec
几种压缩格式:
- Zlib:org.apache.hadoop.io.compress.DefaultCodec
- Gzip:org.apache.hadoop.io.compress.GzipCodec
- Bzip2:org.apache.hadoop.io.compress.Bzip2Codec
- Lzo:com.hadoop.compression.lzo.LzoCodec
- Lz4:org.apache.hadoop.io.compress.Lz4Codec
- Snappy:org.apache.hadoop.io.compress.SnappyCodec
压缩的好处:
- hadoop jobs are usually IO bound(减少hadoop作业的IO)
- Compression reduces the size of data transferred across network(减少了跨网络传输的数据大小)
- Overall job performance may be increased by simple enabing compression(通过简单的增强压缩,可以提高整体工作效率)
- Splittability must be taken into account(必须考虑压缩文件可分片)