一 压缩目的
可以减少对集群磁盘空间的占用,减小并行计算数据传输时网络IO
二 压缩种类
SnappyCodec,GzipCodec,BZip2Codec,Lz4Codec,LzoCodec
三 依赖
SnappyCodec与LzoCodec需要本地库的支持
四 本地库的编译
1. Lzo本地库的编译
1.1安装lzo-2.06.tar.gz
1.2步骤:解压;进入lzo-2.06.目录;configure;make && make install
1.3报错处理:yum install gcc-c++
1.4下载hadoop-lzo-master.zip
1.5然后利用hadoop的编译环境,进入解压目录输入mvn package打包
1.6最终在target里会有
(1)so文件,即本地库,将其拷到/opt/hadoop-2.5.1/lib/native目录下
(2)Jar包,将其拷到/opt/hadoop-2.5.1/share/hadoop/common目录下
2. Snappy本地库编译
Snappy得需要特别注意,以前是跟lzo的编译步骤差不多,新版本是在hadoop的编译基础上编译,命令是
mvn package -Pdist,native -DskipTests -Drequire.snappy
也就是hadoop的本地库支持snappy
五 压缩测试
1. 测试是否成功安装,就用wordcount来测吧
1.1代码(main方法)
public static void main(String[] args) throws Exception {
// String codecClassname = "org.apache.hadoop.io.compress.SnappyCodec";
// String codecClassname = "org.apache.hadoop.io.compress.GzipCodec";
// String codecClassname = "org.apache.hadoop.io.compress.BZip2Codec";
// String codecClassname = "org.apache.hadoop.io.compress.Lz4Codec";
// String codecClassname = "org.apache.hadoop.io.compress.LzoCodec";
Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://192.168.2.102:8020/");
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 3) {
System.err.println("Usage: wordcount <in> [<in>...] <out>");
System.exit(2);
}
Job job = new Job(conf, "vrv job");
conf.setBoolean("mapreduce.map.output.compress", true);
// conf.setClass("mapreduce.map.output.compress.codec",SnappyCodec.class, CompressionCodec.class);
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
for (int i = 0; i < otherArgs.length - 2; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 2]));
// FileInputFormat.addInputPath(job, new Path("/qiaoting/input2"));
// FileOutputFormat.setOutputPath(job, new Path("/qiaoting/output"));
FileOutputFormat.setCompressOutput(job, true);
Class compressCla = Class.forName(otherArgs[otherArgs.length - 1]);
FileOutputFormat.setOutputCompressorClass(job, compressCla);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
1.2输入命令hadoop jar qt.jar WordCount /qiaoting/input2 /qiaoting/output org.apache.hadoop.io.compress.LzoCodec
结果截图如下
1.2输入命令hadoop jar qt.jar WordCount /qiaoting/input2 /qiaoting/output org.apache.hadoop.io.compress.SnappyCodec
结果截图如下
1.3输入命令hadoop jar qt.jar WordCount /qiaoting/input2 /qiaoting/output org.apache.hadoop.io.compress.BZip2Codec
结果截图如下
1.4输入命令hadoop jar qt.jar WordCount /qiaoting/input2 /qiaoting/output org.apache.hadoop.io.compress.GzipCodec
结果截图如下
输入命令hadoop jar qt.jar WordCount /qiaoting/input2 /qiaoting/output org.apache.hadoop.io.compress.Lz4Codec
结果截图如下
2. 至于压缩比,就简单看个压缩结果图好了,原始文件compress大小8G左右,压缩比高的,压缩时间也比较长
六 压缩在hadoop中的配置
1.在hadoop-env.sh中添加LD_LIBRARY_PATH=/opt/hadoop-2.5.1/lib/native
2.在core-site.xml中
<property>
<name>hadoop.native.lib</name>
<value>true</value>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec,
org.apache.hadoop.io.compress.BZip2Codec
</value>
</property>
3.在mapred-site.xml 中
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
<description>启用job输出压缩
</description>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.type</name>
<value>RECORD</value>
<description>默认是记录压缩
</description>
</property>
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.DefaultCodec</value>
<description>指定压缩器,每种压缩对应一个压缩器
</description>
</property>
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
<description>启用map结果输出压缩
</description>
</property>
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.DefaultCodec</value>
<description>指定压缩器
</description>
</property>