MapReduce--Shuffle 详解、压缩使用、数据倾斜解决方案、参数调优

最新推荐文章于 2021-06-15 23:18:05 发布

XK&RM

最新推荐文章于 2021-06-15 23:18:05 发布

阅读量550

点赞数

分类专栏： Hadoop 文章标签： hadoop mapreduce shuffle

本文链接：https://blog.csdn.net/qq_41301707/article/details/111171585

版权

Hadoop 专栏收录该内容

27 篇文章 1 订阅

订阅专栏

MapReduce--Shuffle 详解、压缩使用、数据倾斜解决方案、参数调优

Hadoop官网--MapReduceTutorial

1 MapReduce--Shuffle 详解

MapReduce Shuffle分为Map端的Shuffle以及Reduce Shuffle
Map端的Shuffle分为
- Map端的输出作为Map Shuffle的输入
- 首先会把数据加载到一个环形缓冲区里面,环形缓冲区的大小由mapreduce.task.io.sort.mb这个参数决定，这个参数默认是100M
- Spill：当环境缓冲区里面的大小达到一定的阈值，则开启一个线程把数据刷新到磁盘里面，刷新数据到磁盘的阈值大小是由mapreduce.map.sort.spill.percent这个参数决定的，默认是百分之八十，即当环形缓冲区里面的数据达到80M的时候则会把数据刷新到磁盘里面,在Spill之前会有Partition、Sort、Combine操作，之后才会写入到磁盘上面
- Partition：Spill 刷新到磁盘之前会先进行分区，默认的是HashPartitioner，它会根据key的hash值取模ReduceTask数量得到的结果(相同值在同一个分区)进行分区。在不同业务场景下，我们也可以自定义Partitioner来重写分区规则
- Sort：排序操作首先对分区进行升序排序sort，当然我们也可以通过重写数据的compartTo方法来自定义排序规则
- Combine：局部聚合，把Map端的数据先进行预聚合，可以有效的减少Map->Reduce的IO资源浪费，但是有一些场景不适合预聚合的，最经典的场景是求平均数，可以通过代码里面的job.setCombinerClass(MyCombine.class)来设置，具体案例可以参考CombinerDirver
- Merge：数据量比较大的情况，一个map会有多次把环形缓冲区里面的数据刷新到磁盘的操作，每次操作都会由一个文件生成，会导致一个Map会对应磁盘中的多个文件，Merge是把一个Map中多个文件合并成，最终每个Map只会产生一个文件，并把之前的临时文件全部删除，如果想减少Merge文件数可以通过调大mapreduce.task.io.sort.mb这个参数来实现，增大环形缓冲区的大小，可以减少Spill到磁盘的数量，以次来减少Merge文件数，这个阶段结束意味着Map端的Shuffle结束，下面开始进行Reduce Shuffle阶段
Reduce 端Shuffle阶段
- 数据加载到Buffer：Reduce端接受数据也是先把数据加载到内存当中，它的大小要比Map Task的环形缓冲区更灵活(由JVM的heapsize设置)
- Spill：当Buffer里面的数据达到一定的阈值，则会把数据刷新到磁盘当中
- Merge：把Spill写到磁盘中的多个文件合并成一个文件，并把临时文件删除，后面进行Reduce操作

2 MepReduce--压缩使用

2.1 什么是压缩

压缩是一种通过特定的算法来减小计算机文件大小的机制。这种机制是一种很方便的发明，尤其是对网络用户，因为它可以减小文件的字节总数，使文件能够通过较慢的互联网连接实现更快传输，此外还可以减少文件的磁盘占用空间

2.2 Hadoop常用的压缩格式

gzip
lzo
snappy
bzip2

2.3 文件系统进行文件压缩

2.3.1 需求

对compression.data进行压缩以及解压

2.3.2 Code

2.3.2.1 造数据Code

package com.xk.bigata.hadoop.mapreduce.compression.mock;

import java.io.*;
import java.util.Random;

public class MockData {

    public static void main(String[] args) {

        BufferedWriter writer = null;
        try {
            writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("mapreduce-basic/data/compression.data")));
            String[] words = new String[]{"hadoop", "flink", "hbase", "spark", "kafka"};
            for (int i = 0; i < 1000000; i++) {
                String line = "";
                for (int j = 0; j < 30; j++) {
                    line = line + words[new Random().nextInt(words.length)] + ",";
                }
                line += "\n";
                writer.write(line);
                writer.flush();
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            if (null != writer) {
                try {
                    writer.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
    }
}

MockData Code

compression.data文件187M

2.3.2.2 CompressUtils Code

package com.xk.bigata.hadoop.mapreduce.compression;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionInputStream;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.*;

public class CompressUtils {

    public static void main(String[] args) throws Exception {
        compress("mapreduce-basic/data/compression.data", "org.apache.hadoop.io.compress.BZip2Codec");
        decompression("mapreduce-basic/data/compression.data.bz2");
    }

    /**
     * 压缩文件
     *
     * @param fileName 文件名
     * @param codeC    压缩格式
     */
    public static void compress(String fileName, String codeC) throws Exception {
        FileInputStream fis = new FileInputStream(new File(fileName));
        Class<?> codecClass = Class.forName(codeC);
        CompressionCodec codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, new Configuration());
        FileOutputStream fos = new FileOutputStream(new File(fileName + codec.getDefaultExtension()));
        CompressionOutputStream cos = codec.createOutputStream(fos);
        IOUtils.copyBytes(fis, cos, 1024 * 1024 * 5);
        cos.close();
        fos.close();
        fis.close();
    }

    /**
     * 解压文件
     *
     * @param fileName
     * @throws Exception
     */
    public static void decompression(String fileName) throws Exception {
        CompressionCodecFactory factory = new CompressionCodecFactory(new Configuration());
        CompressionCodec codec = factory.getCodec(new Path(fileName));
        if (null == codec) {
            System.out.println("找不到codec：" + codec.getDefaultExtension());
            return;
        }
        CompressionInputStream cis = codec.createInputStream(new FileInputStream(new File(fileName)));
        FileOutputStream fos = new FileOutputStream(new File(fileName + ".bak"));
        IOUtils.copyBytes(cis, fos, 1024 * 1024 * 5);
        fos.close();
        cis.close();
    }
}

CompressUtils Code

解压前文件大小187M
解压后文件大小10.7M

2.4 MapReduce 使用压缩

2.4.1 需求

使用MapReduce读取压缩数据，最终输出压缩文件

2.4.2 Code

2.4.2.1 CompressionDirver Code

package com.xk.bigata.hadoop.mapreduce.compression;

import com.xk.bigata.hadoop.utils.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class CompressionDirver {

    public static void main(String[] args) throws Exception {
        String input = "mapreduce-basic/data/wc.txt.bz2";
        String output = "mapreduce-basic/out";

        // 1 创建 MapReduce job
        Configuration conf = new Configuration();
        // 开启Map端输出压缩
        conf.setBoolean("mapreduce.map.output.compress", true);
        // 指定Map端压缩格式
        conf.setClass("mapreduce.map.output.compress.codec", BZip2Codec.class, CompressionCodec.class);
        // 开启Reduce端输出压缩
        conf.setBoolean("mapreduce.output.fileoutputformat.compress", true);
        // 指定Reduce端压缩格式
        conf.setClass("mapreduce.output.fileoutputformat.compress.codec", BZip2Codec.class, CompressionCodec.class);
        Job job = Job.getInstance(conf);

        // 删除输出路径
        FileUtils.deleteFile(job.getConfiguration(), output);

        // 2 设置运行主类
        job.setJarByClass(CompressionDirver.class);

        // 3 设置Map和Reduce运行的类
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);

        // 4 设置Map 输出的 KEY 和 VALUE 数据类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5 设置Reduce 输出 KEY 和 VALUE 数据类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 6 设置输入和输出路径
        FileInputFormat.setInputPaths(job, new Path(input));
        FileOutputFormat.setOutputPath(job, new Path(output));

        // 7 提交job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }

    public static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
        IntWritable ONE = new IntWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] spilts = value.toString().split(",");
            for (String word : spilts) {
                context.write(new Text(word), ONE);
            }
        }
    }

    public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int count = 0;
            for (IntWritable value : values) {
                count += value.get();
            }
            context.write(key, new IntWritable(count));
        }
    }
}

CompressionDirver Code

2.5 MapReduce 压缩配置参数

可以选择一个合理的压缩格式,把上面配置的参数配置到配置文件里面

2.5.1 配置core-site.xml

可以通过以下命令查看集群内部的Hadoop目前支持哪些压缩格式

[root@bigdatatest01 ~]# hadoop checknative
20/12/15 09:40:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Native library checking:
hadoop:  false 
zlib:    false 
zstd  :  false 
snappy:  false 
lz4:     false 
bzip2:   false 
openssl: false 
ISA-L:   false 
20/12/15 09:40:09 INFO util.ExitUtil: Exiting with status 1: ExitException

如果压缩格式都不支持，需要从网上下载源代码然后重新编译，然后把生成的压缩的jar包放进自己的环境里面

<property>
	<name>io.compression.codecs</name>
	<value>org.apache.hadoop.io.compress.GzipCodec,
		  org.apache.hadoop.io.compress.DefaultCodec,
		  org.apache.hadoop.io.compress.BZip2Codec,
		  org.apache.hadoop.io.compress.SnappyCodec
	</value>
</property>

2.5.2 配置mapred-site.xml

<property>
    <name>mapreduce.output.fileoutputformat.compress</name>
    <value>true</value>
</property>

<property>
    <name>mapreduce.output.fileoutputformat.compress.codec</name>
    <value>org.apache.hadoop.io.compress.BZip2Codec</value>
</property>

只需要配置Reduce端的输出文件进行压缩即可

3 MapReduce--数据倾斜解决方案

3.1 需求

MapReduce遇到数据倾斜应该如何处理

3.2 架构设计

遇到数据倾斜，根本原因就是在Shuffle的时候Key分布不均匀导致的
我们需要把聚集的Key打散，然后进行Shuffle这样可以有效的避免数据倾斜
把一个MapReduce作业拆成两个MapReduce作业，做法如下
- 第一个MapReduce作业，把Key前面加一个十以内的随机数，然后进行作业，把结果计算出来
- 第二个MapReduce作业把上个MapReduce输出作为本次MapReduce作业的输入，然后把key前面的随机数给去掉，然后再进行Shuffle以及聚合

3.3 Code

3.3.1 DataSkewDriver Code

package com.xk.bigata.hadoop.mapreduce.dataskew;

import com.xk.bigata.hadoop.utils.FileUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;
import org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.Random;

public class DataSkewDriver {

    public static void main(String[] args) throws Exception {
        String input = "mapreduce-basic/data/wc.txt";
        String midput = "mapreduce-basic/out/midput";
        String output = "mapreduce-basic/out/output";

        // Job 1

        // 1 创建 MapReduce job
        Configuration conf = new Configuration();
        Job job1 = Job.getInstance(conf);

        // 删除输出路径
        FileUtils.deleteFile(job1.getConfiguration(), midput);
        FileUtils.deleteFile(job1.getConfiguration(), output);

        // 2 设置运行主类
        job1.setJarByClass(DataSkewDriver.class);

        // 3 设置Map和Reduce运行的类
        job1.setMapperClass(MyMapper1.class);
        job1.setReducerClass(MyReducer.class);

        // 4 设置Map 输出的 KEY 和 VALUE 数据类型
        job1.setMapOutputKeyClass(Text.class);
        job1.setMapOutputValueClass(IntWritable.class);

        // 5 设置Reduce 输出 KEY 和 VALUE 数据类型
        job1.setOutputKeyClass(Text.class);
        job1.setOutputValueClass(IntWritable.class);

        // 6 设置输入和输出路径
        FileInputFormat.setInputPaths(job1, new Path(input));
        FileOutputFormat.setOutputPath(job1, new Path(midput));

        //job1加入控制器
        ControlledJob ctrlJob1 = new ControlledJob(conf);
        ctrlJob1.setJob(job1);

        // Job 2

        // 1 创建 MapReduce job
        Job job2 = Job.getInstance(conf);

        // 2 设置运行主类
        job2.setJarByClass(DataSkewDriver.class);

        // 3 设置Map和Reduce运行的类
        job2.setMapperClass(MyMapper2.class);
        job2.setReducerClass(MyReducer.class);

        // 4 设置Map 输出的 KEY 和 VALUE 数据类型
        job2.setMapOutputKeyClass(Text.class);
        job2.setMapOutputValueClass(IntWritable.class);

        // 5 设置Reduce 输出 KEY 和 VALUE 数据类型
        job2.setOutputKeyClass(Text.class);
        job2.setOutputValueClass(IntWritable.class);

        // 6 设置输入和输出路径
        FileInputFormat.setInputPaths(job2, new Path(midput));
        FileOutputFormat.setOutputPath(job2, new Path(output));

        //job2加入控制器
        ControlledJob ctrlJob2 = new ControlledJob(conf);
        ctrlJob2.setJob(job2);

        //设置作业之间的以来关系，job2的输入以来job1的输出
        ctrlJob2.addDependingJob(ctrlJob1);

        //设置主控制器，控制job1和job2两个作业
        JobControl jobCtrl = new JobControl("myCtrl");
        //添加到总的JobControl里，进行控制
        jobCtrl.addJob(ctrlJob1);
        jobCtrl.addJob(ctrlJob2);

        //在线程中启动，记住一定要有这个
        Thread thread = new Thread(jobCtrl);
        thread.start();
        while (true) {
            if (jobCtrl.allFinished()) {
                System.out.println(jobCtrl.getSuccessfulJobList());
                jobCtrl.stop();
                break;
            }
        }
    }

    public static class MyMapper1 extends Mapper<LongWritable, Text, Text, IntWritable> {
        IntWritable ONE = new IntWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            Random random = new Random();
            String[] spilts = value.toString().split(",");
            for (String word : spilts) {
                int num = random.nextInt(10);
                context.write(new Text(num + "-" + word), ONE);
            }
        }
    }

    public static class MyMapper2 extends Mapper<LongWritable, Text, Text, IntWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] spilts = value.toString().split("\t");
            String[] wordSpilt = spilts[0].split("-");
            context.write(new Text(wordSpilt[1]), new IntWritable(Integer.parseInt(spilts[1])));
        }
    }

    public static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int count = 0;
            for (IntWritable value : values) {
                count += value.get();
            }
            context.write(key, new IntWritable(count));
        }
    }
}

DataSkewDriver Code

3.4 结果

3.4.1 中间结果

0-hadoop	2
0-spark	1
1-hadoop	1
1-hbase	1
1-spark	2
2-hadoop	1
2-hbase	1
3-hadoop	1
3-spark	2
5-flink	3
5-hadoop	1
6-spark	1
7-flink	2
7-hbase	1
7-spark	1
8-hadoop	1

可以看出Key被打散了,可以有效的避免数据倾斜

3.4.2 最终结果

flink	5
hadoop	7
hbase	3
spark	7

4 MepReduce--参数调优

配置完参数需要重启才会生效

4.1 以下参数是在用户自己的 MapReduce 应用程序中配置就可以生

效

mapreduce.map.memory.mb

一个MapTask可使用的资源上限（单位:MB），默认为1024。
如果MapTask实际使用的资源量超过该值，则会被强制杀死。

mapreduce.reduce.memory.mb

一个ReduceTask可使用的资源上限（单位:MB），默认为1024。
如果ReduceTask实际使用的资源量超过该值，则会被强制杀死。

mapreduce.map.cpu.vcores

每个MapTask可使用的最多cpu core数目，默认值: 1

mapreduce.reduce.cpu.vcores

每个ReduceTask可使用的最多cpu core数目，默认值: 1

mapreduce.map.java.opts

MapTask的JVM参数，你可以在此配置默认的java heap size等参数，
比如："-Xmx2048m -verbose:gc -Xloggc:/tmp/@taskid@.gc"，默认值是：""

mapreduce.reduce.java.opts

ReduceTask的JVM参数，你可以在此配置默认的java heap size等参数

mapreduce.task.io.sort.mb=100

shuffle的环形缓冲区大小，默认100m

mapreduce.map.sort.spill.percent=0.8

环形缓冲区溢出的阈值，默认80%

mapreduce.reduce.shuffle.parallelcopies

MapReduce程序reducer copy数据的线程数，默认5。

mapreduce.reduce.shuffle.input.buffer.percent

reduce复制map数据的时候指定的内存堆大小百分比，默认为0.70
适当的增加该值可以减少map数据的磁盘溢出，能够提高系统能。

mapreduce.reduce.shuffle.merge.percentreduce

reduce进行shuffle的时候，用于启动合并输出和磁盘溢写的过程的阀值，默认为0.66。
如果允许，适当增大其比例能够减少磁盘溢写次数，提高系统性能。
同mapreduce.reduce.shuffle.input.buffer.percent一起使用

4.2 以下参数应该在yarn启动之前就配置在服务器的配置文件中才能生效

yarn.scheduler.minimum-allocation-mb=1024

给应用程序container分配的最小内存

yarn.scheduler.maximum-allocation-mb=8192

给应用程序container分配的最大内存

yarn.scheduler.minimum-allocation-vcores=1

给应用程序container分配的最小的VCore

yarn.scheduler.maximum-allocation-vcores=32

给应用程序container分配的最大的VCore

yarn.nodemanager.resource.memory-mb=8192

nodemanager 的内存大小

4.3 容错相关参数

mapreduce.map.maxattempts=4

每个MapTask最大重试次数，一旦重试参数超过该值，则认为MapTask运行失败

mapreduce.reduce.maxattempts=4

每个ReduceTask最大重试次数，一旦重试参数超过该值，则认为MapTask运行失败

mapreduce.task.timeout=600000

Task超时时间，经常需要设置的一个参数，该参数表达的意思为：
如果一个task在一定时间内没有任何进入，即不会读取新的数据，也没有输出数据，则认为该task
处于block状态，可能是卡住了，也许永远会卡主，为了防止因为用户程序永远block住不退出，则强制
设置了一个该超时时间（单位毫秒），老版本默认是300000。如果你的程序对每条输入数据的处理时
间过长（比如会访问数据库，通过网络拉取数据等），建议将该参数调大