五、Hadoop数据压缩

卌卄

已于 2023-08-11 18:58:48 修改

阅读量134

点赞数

分类专栏： # MapReduce 文章标签： hadoop 大数据分布式

于 2023-08-11 18:57:30 首次发布

本文链接：https://blog.csdn.net/weixin_44340038/article/details/132231605

版权

MapReduce 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

一、MapReduce概述
 二、Hadoop序列化
 三、MapReduce框架原理（InputFormat数据输入、MapReduce工作流程、Shuffle机制）
四、MapReduce框架原理（OutputFormat数据输出、MapReduce内核源码解析、Join应用、数据清洗（ETL）、MapReduce开发总结）
五、Hadoop数据压缩

1 、概述

1）压缩的好处和坏处

压缩的优点：以减少磁盘IO、减少磁盘存储空间。

压缩的缺点：增加CPU开销。

2）压缩原则

（1）运算密集型的Job，少用压缩

（2）IO密集型的Job，多用压缩

2 、MR支持的压缩编码

1）压缩算法对比介绍

压缩格式	Hadoop自带？	算法	文件扩展名	是否可切片	换成压缩格式后，原来的程序是否需要修改
DEFLATE	是，直接使用	DEFLATE	.deflate	否	和文本处理一样，不需要修改
Gzip	是，直接使用	DEFLATE	.gz	否	和文本处理一样，不需要修改
bzip2	是，直接使用	bzip2	.bz2	是	和文本处理一样，不需要修改
LZO	否，需要安装	LZO	.lzo	是	需要建索引，还需要指定输入格式
Snappy	是，直接使用	Snappy	.snappy	否	和文本处理一样，不需要修改

2）压缩性能的比较

压缩算法	原始文件大小	压缩文件大小	压缩速度	解压速度
gzip	8.3GB	1.8GB	17.5MB/s	58MB/s
bzip2	8.3GB	1.1GB	2.4MB/s	9.5MB/s
LZO	8.3GB	2.9GB	49.3MB/s	74.6MB/s

snappy | A fast compressor/decompressor

Snappy is a compression/decompression library. It does not aim for maximum compression, or compatibility with any other compression library; instead, it aims for very high speeds and reasonable compression. For instance, compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger.On a single core of a Core i7 processor in 64-bit mode, Snappy compresses at about 250 MB/sec or more and decompresses at about 500 MB/sec or more.

3 、压缩方式选择

压缩方式选择时重点考虑：压缩/解压缩速度、压缩率（压缩后存储大小）、压缩后是否可以支持切片。

3.1 Gzip压缩

优点：压缩率比较高；

缺点：不支持Split；压缩/解压速度一般；

3.2 Bzip2压缩

优点：压缩率高；支持Split；

缺点：压缩/解压速度慢。

3.3 Lzo压缩

优点：压缩/解压速度比较快；支持Split；

缺点：压缩率一般；想支持切片需要额外创建索引。

3.4 Snappy压缩

优点：压缩和解压缩速度快；

缺点：不支持Split；压缩率一般；

3.5 压缩位置选择

压缩可以在MapReduce作用的任意阶段启用。

4、压缩参数配置

1）为了支持多种压缩/解压缩算法，Hadoop引入了编码/解码器

压缩格式	对应的编码/解码器
DEFLATE	org.apache.hadoop.io.compress.DefaultCodec
gzip	org.apache.hadoop.io.compress.GzipCodec
bzip2	org.apache.hadoop.io.compress.BZip2Codec
LZO	com.hadoop.compression.lzo.LzopCodec
Snappy	org.apache.hadoop.io.compress.SnappyCodec

2）要在Hadoop中启用压缩，可以配置如下参数

参数	默认值	阶段	建议
io.compression.codecs （在core-site.xml中配置）	无，这个需要在命令行输入hadoop checknative查看	输入压缩	Hadoop使用文件扩展名判断是否支持某种编解码器
mapreduce.map.output.compress（在mapred-site.xml中配置）	false	mapper输出	这个参数设为true启用压缩
mapreduce.map.output.compress.codec（在mapred-site.xml中配置）	org.apache.hadoop.io.compress.DefaultCodec	mapper输出	企业多使用LZO或Snappy编解码器在此阶段压缩数据
mapreduce.output.fileoutputformat.compress（在mapred-site.xml中配置）	false	reducer输出	这个参数设为true启用压缩
mapreduce.output.fileoutputformat.compress.codec（在mapred-site.xml中配置）	org.apache.hadoop.io.compress.DefaultCodec	reducer输出	使用标准工具或者编解码器，如gzip和bzip2

5、压缩实操案例

5.1 Map输出端采用压缩

即使你的MapReduce的输入输出文件都是未压缩的文件，你仍然可以对Map任务的中间结果输出做压缩，因为它要写在硬盘并且通过网络传输到Reduce节点，对其压缩可以提高很多性能，这些工作只要设置两个属性即可，我们来看下代码怎么设置。

1）给大家提供的Hadoop源码支持的压缩格式有：BZip2Codec、DefaultCodec

package yasuo;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCountDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

        // 1. 获取job
        Configuration conf = new Configuration();
        // 开启map端输出压缩
        conf.setBoolean("mapreduce.map.output.compress", true);

        // 设置map端输出压缩方式
        conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);

        Job job = Job.getInstance(conf);

        // 2.设置jar路径
        job.setJarByClass(WordCountDriver.class);

        // 3.关联mapper和reduce
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // 4.设置map输出的kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5.设置最终输出的kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        // 6.设置输入路径和输出路径
        FileInputFormat.setInputPaths(job,new Path("E:\\IDEA\\hadoop\\input\\inputword"));
        FileOutputFormat.setOutputPath(job,new Path("E:\\IDEA\\hadoop\\output\\inputword34"));


        // 7.提交job
        boolean result = job.waitForCompletion(true);

        System.exit(result ? 0 : 1);


    }
}

2）Mapper保持不变

package yasuo;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/*
* KEYN,map阶段输入的可以的类型：LongWritable
* VALUEIN,map阶段value类型：Text
* KEYOUT，map阶段输出的Key类型：Text
* VALEOUT,map阶段输出的value类型：IntWritable
 */
public class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
    private Text outk = new Text();
    private IntWritable outV  = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        // 1.获取一行
        // ss ss
        String line = value.toString();
        
        // 2.切割
        // ss
        // ss
        String[] words = line.split(" ");


        // 3.循环写出
        for (String word : words) {
            Text text = new Text();
            outk.set(word);
            context.write(outk,outV);
        }
    }
}

3）Reducer保持不变

package yasuo;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;


/*
 * KEYN,map阶段输入的可以的类型：Text
 * VALUEIN,map阶段value类型：IntWritable
 * KEYOUT，map阶段输出的Key类型：Text
 * VALEOUT,map阶段输出的value类型：IntWritable
 */
public class WordCountReducer extends Reducer<Text, IntWritable,Text,IntWritable> {
    private IntWritable outV = new IntWritable();
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {

        int sum = 0;
        // ss(1,1)
        for (IntWritable value : values) {
            sum += value.get();
        }
        outV.set(sum);

        // 写出
        context.write(key,outV);
    }
}

5.2 Reduce输出端采用压缩

基于WordCount案例处理。

1）修改驱动

package yasuo;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WordCountDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {

        // 1. 获取job
        Configuration conf = new Configuration();

        Job job = Job.getInstance(conf);

        // 2.设置jar路径
        job.setJarByClass(WordCountDriver.class);

        // 3.关联mapper和reduce
        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        // 4.设置map输出的kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5.设置最终输出的kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        // 6.设置输入路径和输出路径
        FileInputFormat.setInputPaths(job,new Path("E:\\IDEA\\hadoop\\input\\inputword"));
        FileOutputFormat.setOutputPath(job,new Path("E:\\IDEA\\hadoop\\output\\inputword34"));

        // 设置reduce端输出压缩开启
        FileOutputFormat.setCompressOutput(job, true);

        // 设置压缩的方式
        FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);
//	    FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
//	    FileOutputFormat.setOutputCompressorClass(job, DefaultCodec.class);

        // 7.提交job
        boolean result = job.waitForCompletion(true);

        System.exit(result ? 0 : 1);


    }
}

2）Mapper和Reducer保持不变（详见5.1）

卌卄

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
五、Hadoop数据压缩

即使你的MapReduce的输入输出文件都是未压缩的文件，你仍然可以对Map任务的中间结果输出做压缩，因为它要写在硬盘并且通过网络传输到Reduce节点，对其压缩可以提高很多性能，这些工作只要设置两个属性即可，我们来看下代码怎么设置。mapreduce.output.fileoutputformat.compress.codec（在mapred-site.xml中配置）mapreduce.map.output.compress.codec（在mapred-site.xml中配置）缺点：不支持Split；
复制链接

扫一扫