0090-mapreduce自定义分组

1. 需求

mapper将结果发送到Reducer会进行数据分组,默认是分在同一组,有时候会根据不通的业务需求进行分组。
注:默认的分组逻辑,(key.hashCode() & Integer.MAX_VALUE)表示用key的hash值对最大整数+1取余,然再对任务数(默认1)取余数(都是0),所以默认都是一个组

(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks

2. 实现步骤

两步:1. 自定义分组实现类; 2. job类中设置自定义分组实现类,设置任务数

2.1 实体类
@Data
public class FlowBean implements WritableComparable<FlowBean> {
    private String number;
    private long upFlow;
    private long downFlow;
    private long sumFlow;

    public FlowBean(String number, long upFlow, long downFlow) {
        this.number = number;
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = upFlow + downFlow;
    }

    public FlowBean() {
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(number);
        out.writeLong(upFlow);
        out.writeLong(downFlow);
        out.writeLong(sumFlow);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.number = in.readUTF();
        this.upFlow = in.readLong();
        this.downFlow = in.readLong();
        this.sumFlow = in.readLong();
    }

    @Override
    public String toString() {
        return number + " " + upFlow + " " + downFlow + " " + sumFlow;
    }

    @Override
    public int compareTo(FlowBean o) {
        if (this.sumFlow > o.getSumFlow()) {
            return -1;
        } else if (this.sumFlow < o.getSumFlow()) {
            return 1;
        } else {
            return 0;
        }
    }
}

注:因为这个例子中mapper程序输入的是FlowBean实体类,从Mapper到Reducer会自动将key进行一轮排序,所以自定义实体类的时候一定要实现WritableComparable接口

2.2 Mapper程序
public class GroupFlowMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] args = line.split("\\s");
        String number = args[0];
        Long upFlow = Long.parseLong(args[1]);
        Long downFlow = Long.parseLong(args[2]);
        FlowBean flowBean = new FlowBean(number, upFlow, downFlow);
        context.write(flowBean, NullWritable.get());
    }
}
2.3 自定义Partitioner
public class NumPartitioner extends Partitioner<FlowBean, NullWritable> {
    private static Map<String, Integer> map;

    static {
        map = new HashMap<>();
        // 分组需要从0开始
        map.put("135", 0);
        map.put("136", 1);
        map.put("137", 2);
        map.put("138", 3);
    }

    @Override
    public int getPartition(FlowBean flowBean, NullWritable nullWritable, int numPartitions) {
        Integer partitionNum = map.get(flowBean.getNumber().substring(0, 3));
        return partitionNum == null ? 4 : partitionNum;
    }
}

注:分组组号需要从0开始

2.4 Reducer程序
public class GroupFlowReducer extends Reducer<FlowBean, NullWritable, FlowBean, NullWritable> {
    @Override
    protected void reduce(FlowBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        for (NullWritable value : values) {
            context.write(key, value);
        }
    }
}
2.5 执行job
public class GroupFlowRunner extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Configured conf = new Configured();
        Job job = Job.getInstance();
        job.setJarByClass(GroupFlowRunner.class);

        job.setMapperClass(GroupFlowMapper.class);
        job.setReducerClass(GroupFlowReducer.class);

        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(FlowBean.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 设置自定义Partitioner
        job.setPartitionerClass(NumPartitioner.class);
        // 任务数1或者大于等于分组数都可以
        job.setNumReduceTasks(5);

        return job.waitForCompletion(true) ? 1 : 0;
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new Configuration(), new GroupFlowRunner(), args);
    }
}
2.6 总结
  • 自定义Partitioner,组号一定要从0开始
  • job中设置自定义Partitioner
  • job中设置分组任务数,任务数等于或者大于分组数都可以,为1也可以。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值