1. 需求
mapper将结果发送到Reducer会进行数据分组,默认是分在同一组,有时候会根据不通的业务需求进行分组。
注:默认的分组逻辑,(key.hashCode() & Integer.MAX_VALUE)表示用key的hash值对最大整数+1取余,然再对任务数(默认1)取余数(都是0),所以默认都是一个组
(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks
2. 实现步骤
两步:1. 自定义分组实现类; 2. job类中设置自定义分组实现类,设置任务数
2.1 实体类
@Data
public class FlowBean implements WritableComparable<FlowBean> {
private String number;
private long upFlow;
private long downFlow;
private long sumFlow;
public FlowBean(String number, long upFlow, long downFlow) {
this.number = number;
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = upFlow + downFlow;
}
public FlowBean() {
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(number);
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(sumFlow);
}
@Override
public void readFields(DataInput in) throws IOException {
this.number = in.readUTF();
this.upFlow = in.readLong();
this.downFlow = in.readLong();
this.sumFlow = in.readLong();
}
@Override
public String toString() {
return number + " " + upFlow + " " + downFlow + " " + sumFlow;
}
@Override
public int compareTo(FlowBean o) {
if (this.sumFlow > o.getSumFlow()) {
return -1;
} else if (this.sumFlow < o.getSumFlow()) {
return 1;
} else {
return 0;
}
}
}
注:因为这个例子中mapper程序输入的是FlowBean实体类,从Mapper到Reducer会自动将key进行一轮排序,所以自定义实体类的时候一定要实现WritableComparable接口
2.2 Mapper程序
public class GroupFlowMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] args = line.split("\\s");
String number = args[0];
Long upFlow = Long.parseLong(args[1]);
Long downFlow = Long.parseLong(args[2]);
FlowBean flowBean = new FlowBean(number, upFlow, downFlow);
context.write(flowBean, NullWritable.get());
}
}
2.3 自定义Partitioner
public class NumPartitioner extends Partitioner<FlowBean, NullWritable> {
private static Map<String, Integer> map;
static {
map = new HashMap<>();
// 分组需要从0开始
map.put("135", 0);
map.put("136", 1);
map.put("137", 2);
map.put("138", 3);
}
@Override
public int getPartition(FlowBean flowBean, NullWritable nullWritable, int numPartitions) {
Integer partitionNum = map.get(flowBean.getNumber().substring(0, 3));
return partitionNum == null ? 4 : partitionNum;
}
}
注:分组组号需要从0开始
2.4 Reducer程序
public class GroupFlowReducer extends Reducer<FlowBean, NullWritable, FlowBean, NullWritable> {
@Override
protected void reduce(FlowBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
for (NullWritable value : values) {
context.write(key, value);
}
}
}
2.5 执行job
public class GroupFlowRunner extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
Configured conf = new Configured();
Job job = Job.getInstance();
job.setJarByClass(GroupFlowRunner.class);
job.setMapperClass(GroupFlowMapper.class);
job.setReducerClass(GroupFlowReducer.class);
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(FlowBean.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 设置自定义Partitioner
job.setPartitionerClass(NumPartitioner.class);
// 任务数1或者大于等于分组数都可以
job.setNumReduceTasks(5);
return job.waitForCompletion(true) ? 1 : 0;
}
public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new GroupFlowRunner(), args);
}
}
2.6 总结
- 自定义Partitioner,组号一定要从0开始
- job中设置自定义Partitioner
- job中设置分组任务数,任务数等于或者大于分组数都可以,为1也可以。