hadoop mapReduce数据倾斜原因及解决方案

最新推荐文章于 2024-05-14 22:51:43 发布

jiayeliDoCn

最新推荐文章于 2024-05-14 22:51:43 发布

阅读量1.1k

点赞数 1

文章标签： hadoop 大数据

本文链接：https://blog.csdn.net/weixin_46661903/article/details/108795143

版权

1.什么是数据倾斜

数据倾斜顾名思义就是数据分派不均匀，是对分布式系统或者集群产生的海量数据分配问题，如同你妈买了一百个苹果，给了你弟弟八十个，给你二十个，要求你们全都吃完了才会再买下一次的苹果（你们都喜欢吃苹果），这样子的分配方案显然是不合理的，你弟弟和你一天吃一样的苹果，那你苹果吃完了就得等你弟弟吃完所有苹果才会得到下一次的苹果，这段时间你会饥渴难耐有没有，而你弟弟还可能吃嗨了把持不住，一天吃了二十个拉肚子了，你就得等到他病好了吃完苹果才能得到下次的苹果，这无疑会让你们兄弟间心生隔阂，~~这就是著名的苹果倾斜~~。对应大数据行业，处理的数据量可能都是BP或者TP级的，需要多台机器进行集群处理，如果存在分配不合理的情况，就会极大的影响集群任务处理的效率。

故数据倾斜，就是由于数据处理任务在任务分配时，对拥有相同处理资源的机器，数据量分配不均造成的集群整体处理效率低下的问题

2.hadoop mapReduce为什么会产生数据倾斜

mapReduce数据处理流程

数据倾斜是由于数据分配产生的，mapReduce的数据分配主要有数据分片，数据分区和数据下载，其中分片是按照文件数量和文件大小来分片的，所以不会倾斜，而数据分区hadoop默认是采用key.hashcode&Integer.MaxValue % numReduceTask来进行分区号分配，后面的分区下载数据也是根据分区号来的，所以如果key的hashcode值不均匀，其分区号分配就会倾斜，数据在进行按分区号归并时就会产生倾斜。

3.解决方案

hadoop默认的分区方案按key的hashcode来进行分区，所以数据倾斜主要就是key名的锅，我们可以在mapper阶段对key进行重命名（该阶段还未进行分区号分配），只要名称分别均匀就不会造成数据倾斜了，但是缺点就是需要多一次的数据过滤，将设置的key名恢复

具体代码如下：

package com.spj.hadoopLean.MapReduiceDemo.skew;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

import java.io.IOException;
import java.util.Random;

public class Driver {

   /* 
    *第一次数据MR
    public static void main(String[] args) throws IOException, ClassNotFoundException,             InterruptedException {

        Job job = Job.getInstance(new Configuration(), "skewMR");
        job.setJarByClass(Driver.class);
        job.setMapperClass(SkewMapper.class);
        job.setReducerClass(SkewReduce.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
//        job.setOutputFormatClass(SequenceFileOutputFormat.class);
        job.setNumReduceTasks(2);
        FileInputFormat.setInputPaths(job, new Path("src\\main\\java\\com\\spj\\hadoopLean\\MapReduiceDemo\\datas\\skew\\input"));
        FileOutputFormat.setOutputPath(job, new Path("src\\main\\java\\com\\spj\\hadoopLean\\MapReduiceDemo\\datas\\skew\\outputSkew"));
        job.waitForCompletion(true);

    }*/

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        Job job = Job.getInstance(new Configuration(), "skew1MR");
        job.setJarByClass(Driver.class);
        job.setMapperClass(Skew1Mapper.class);
        job.setReducerClass(Skew1Reduce.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
//        job.setOutputFormatClass(SequenceFileOutputFormat.class);
//        job.setNumReduceTasks(2);
        String FilePath = "F:\\j2eeProjecet\\hadoopLean\\src\\main\\java\\com\\spj\\hadoopLean\\MapReduiceDemo\\datas\\skew\\input\\skewResult";
        FileInputFormat.setInputPaths(job, new Path(FilePath));
        FileOutputFormat.setOutputPath(job, new Path("src\\main\\java\\com\\spj\\hadoopLean\\MapReduiceDemo\\datas\\skew\\output1"));
        job.waitForCompletion(true);

    }

    static class SkewMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

        int tasks = 0;
        Random r = new Random();
         IntWritable v = new IntWritable(1);

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            tasks = context.getNumReduceTasks();
        }

        /**
         * 通过随机生成来解决hash造成的数据倾斜问题
         * mapReduce的分区数是有haddop自动决定的，具体为key.hashcode & Integer.MAXVALUE % reduceTask
         * @param key
         * @param value
         * @param context
         * @throws IOException
         * @throws InterruptedException
         */
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] split = value.toString().split("\\s+");

            for (String s : split) {
                s = s + "-" + r.nextInt(tasks);
                value.set(s);
                context.write(value, v);
            }
        }
    }


    static class SkewReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        private int count = 0;
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            for (IntWritable value : values) {
                count++;
            }
            context.write(key, new IntWritable(count));
        }
    }


    static class Skew1Mapper extends Mapper<LongWritable, Text, Text, LongWritable> {

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            try {
                String[] split = value.toString().split("-");
                if (split.length < 3)
                    return;
                value.set(split[0]);
                key.set(Integer.parseInt(split[1].split("\\s+")[1]));
                context.write(value, key);
            } catch (NumberFormatException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }

    static class Skew1Reduce extends Reducer<Text, LongWritable, Text, IntWritable> {
        int sum = 0;
        IntWritable v = new IntWritable();
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            for (LongWritable value : values) {
                sum += value.get();
            }
            v.set(sum);
            context.write(key, v);
        }
    }
}

结果如下：

原来没采用重命名进行倾斜消除时的数据：

其他思路：

把reduce阶段的逻辑在mapper中做处理

重写分区方法

jiayeliDoCn

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
hadoop mapReduce数据倾斜原因及解决方案

1.什么是数据倾斜数据倾斜顾名思义就是数据分派不均匀，是对分布式系统或者集群产生的海量数据分配问题，如同你妈买了一百个苹果，给了你弟弟八十个，给你二十个，要求你们全都吃完了才会再买下一次的苹果（你们都喜欢吃苹果），这样子的分配方案显然是不合理的，你弟弟和你一天吃一样的苹果，那你苹果吃完了就得等你弟弟吃完所有苹果才会得到下一次的苹果，这段时间你会饥渴难耐有没有，而你弟弟还可能吃嗨了把持不住，一天吃了二十个拉肚子了，你就得等到他病好了吃完苹果才能得到下次的苹果，这无疑会让你们兄弟间心生隔阂，这就是著名的苹果
复制链接

扫一扫