1.什么是数据倾斜
数据倾斜顾名思义就是数据分派不均匀,是对分布式系统或者集群产生的海量数据分配问题,如同你妈买了一百个苹果,给了你弟弟八十个,给你二十个,要求你们全都吃完了才会再买下一次的苹果(你们都喜欢吃苹果),这样子的分配方案显然是不合理的,你弟弟和你一天吃一样的苹果,那你苹果吃完了就得等你弟弟吃完所有苹果才会得到下一次的苹果,这段时间你会饥渴难耐有没有,而你弟弟还可能吃嗨了把持不住,一天吃了二十个拉肚子了,你就得等到他病好了吃完苹果才能得到下次的苹果,这无疑会让你们兄弟间心生隔阂,这就是著名的苹果倾斜。对应大数据行业,处理的数据量可能都是BP或者TP级的,需要多台机器进行集群处理,如果存在分配不合理的情况,就会极大的影响集群任务处理的效率。
故数据倾斜,就是由于数据处理任务在任务分配时,对拥有相同处理资源的机器,数据量分配不均造成的集群整体处理效率低下的问题
2.hadoop mapReduce为什么会产生数据倾斜
mapReduce数据处理流程
数据倾斜是由于数据分配产生的,mapReduce的数据分配主要有数据分片,数据分区和数据下载,其中分片是按照文件数量和文件大小来分片的,所以不会倾斜,而数据分区hadoop默认是采用key.hashcode&Integer.MaxValue % numReduceTask来进行分区号分配,后面的分区下载数据也是根据分区号来的,所以如果key的hashcode值不均匀,其分区号分配就会倾斜,数据在进行按分区号归并时就会产生倾斜。
3.解决方案
hadoop默认的分区方案按key的hashcode来进行分区,所以数据倾斜主要就是key名的锅,我们可以在mapper阶段对key进行重命名(该阶段还未进行分区号分配),只要名称分别均匀就不会造成数据倾斜了,但是缺点就是需要多一次的数据过滤,将设置的key名恢复
具体代码如下:
package com.spj.hadoopLean.MapReduiceDemo.skew;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import java.io.IOException;
import java.util.Random;
public class Driver {
/*
*第一次数据MR
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration(), "skewMR");
job.setJarByClass(Driver.class);
job.setMapperClass(SkewMapper.class);
job.setReducerClass(SkewReduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setNumReduceTasks(2);
FileInputFormat.setInputPaths(job, new Path("src\main\java\com\spj\hadoopLean\MapReduiceDemo\datas\skew\input"));
FileOutputFormat.setOutputPath(job, new Path("src\main\java\com\spj\hadoopLean\MapReduiceDemo\datas\skew\outputSkew"));
job.waitForCompletion(true);
}*/
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Job job = Job.getInstance(new Configuration(), "skew1MR");
job.setJarByClass(Driver.class);
job.setMapperClass(Skew1Mapper.class);
job.setReducerClass(Skew1Reduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// job.setOutputFormatClass(SequenceFileOutputFormat.class);
// job.setNumReduceTasks(2);
String FilePath = "F:\j2eeProjecet\hadoopLean\src\main\java\com\spj\hadoopLean\MapReduiceDemo\datas\skew\input\skewResult";
FileInputFormat.setInputPaths(job, new Path(FilePath));
FileOutputFormat.setOutputPath(job, new Path("src\main\java\com\spj\hadoopLean\MapReduiceDemo\datas\skew\output1"));
job.waitForCompletion(true);
}
static class SkewMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
int tasks = 0;
Random r = new Random();
IntWritable v = new IntWritable(1);
@Override
protected void setup(Context context) throws IOException, InterruptedException {
tasks = context.getNumReduceTasks();
}
/**
* 通过随机生成来解决hash造成的数据倾斜问题
* mapReduce的分区数是有haddop自动决定的,具体为key.hashcode & Integer.MAXVALUE % reduceTask
* @param key
* @param value
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] split = value.toString().split("\s+");
for (String s : split) {
s = s + "-" + r.nextInt(tasks);
value.set(s);
context.write(value, v);
}
}
}
static class SkewReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private int count = 0;
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
for (IntWritable value : values) {
count++;
}
context.write(key, new IntWritable(count));
}
}
static class Skew1Mapper extends Mapper<LongWritable, Text, Text, LongWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
try {
String[] split = value.toString().split("-");
if (split.length < 3)
return;
value.set(split[0]);
key.set(Integer.parseInt(split[1].split("\s+")[1]));
context.write(value, key);
} catch (NumberFormatException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
static class Skew1Reduce extends Reducer<Text, LongWritable, Text, IntWritable> {
int sum = 0;
IntWritable v = new IntWritable();
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
for (LongWritable value : values) {
sum += value.get();
}
v.set(sum);
context.write(key, v);
}
}
}
结果如下:
原来没采用重命名进行倾斜消除时的数据:
其他思路:
把reduce阶段的逻辑在mapper中做处理
重写分区方法