Hadoop数据压缩
1. 概述
1. 压缩的好处和坏处:
- 优点:减少磁盘IO,减少磁盘存储空间
- 缺点:增加CPU开销
2. 压缩原则:
- 运算密集型的job,少用压缩
- IO密集型的job,多用压缩
2. MapReduce支持的压缩编码
- 压缩算法对比
压缩格式 | Hadoop是否自带 | 算法 | 文件扩展名 | 是否可切片 | 换成压缩格式后,原来程序是否需修改 |
---|
DEFLATE | 是 | DEFLATE | .deflate | 否 | 和文本处理一样,不需要修改 |
Gzip | 是 | DEFLATE | .gz | 否 | 和文本处理一样,不需要修改 |
bzip2 | 是 | bzip2 | .bz2 | 是 | 和文本处理一样,不需要修改 |
LZO | 否,需要安装 | LZO | .lzo | 是 | 需要建索引,还需要指定输入格式 |
Snappy | 是 | Snappy | .snappy | 否 | 和文本处理一样,不需要修改 |
- 压缩性能比较
压缩算法 | 原始文件大小 | 压缩文件大小 | 压缩速度 | 解压速度 |
---|
gzip | 8.3GB | 1.8GB | 17.5MB/s | 58MB/s |
bzip2 | 8.3GB | 1.1GB | 2.4MB/s | 9.5MB/s |
LZO | 8.3GB | 2.9GB | 49.3MB/s | 74.6Mb/s |
Snappy的压缩速度大约250MB/s,解压速度大约500MB/s
3. 压缩位置的选择
4. 压缩参数配置
- 为了支持多种压缩/解压算法,Hadoop引入了编码/解码器
压缩格式 | 对应的编码/解码器 |
---|
DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzipCodec |
bzip2 | org.apache.hadoop.io.compress.BZip2Codec |
LZO | org.apache.hadoop.io.compress.LzopCodec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
- 要在Hadoop中启用压缩,可以配置如下参数
参数 | 默认值 | 阶段 | 建议 |
---|
io.compress.codecs(在core-site.xml中配置) | 无 | 输入压缩 | Hadoop使用文件扩展名判断是否支持某种编解码器 |
mapreduce.map.output.compress(在mapred-site.xml中配置) | false | mapper输出 | 这个参数设为true启用压缩 |
mapreduce.map.output.compress.codec(在mapred-site.xml中配置) | org.apache.hadoop.io.compress.DefaultCodec | mapper输出 | 企业多使用LZO或Snappy编码器在此阶段压缩数据 |
mapreduce.output.fileoutputformat.compress(在mapred-site.xml 中配置) | false | reducer输出 | 这个参数设为 true 启用压缩 |
mapreduce.output.fileoutputformat.compress.codec(在mapred-site.xml 中配置) | org.apache.hadoop.io.compress.DefaultCodec | reducer输出 | 使用标准工具或者编解码器,如 gzip 和bzip2 |
5. 压缩实战
- 编写Mapper类
package com.codecat.mapreduce.compress;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private Text outK = new Text();
private IntWritable outV = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(" ");
for (String word : words) {
outK.set(word);
context.write(outK, outV);
}
}
}
- 编写Reducer类
package com.codecat.mapreduce.compress;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable outV = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
outV.set(sum);
context.write(key, outV);
}
}
- 编写Driver类
package com.codecat.mapreduce.compress;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordCountDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
conf.setBoolean("mapreduce.map.output.compress", true);
conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
Job job = Job.getInstance(conf);
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path("D:\\CODE-STUDY\\JAVA\\Hadoop_Data\\input\\inputword"));
FileOutputFormat.setOutputPath(job, new Path("D:\\CODE-STUDY\\JAVA\\Hadoop_Data\\output\\outputGzip"));
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}
- 运行结果:
参考: https://www.bilibili.com/video/BV1Qp4y1n7EN?