0090-mapreduce自定义分组

最新推荐文章于 2022-11-18 20:43:16 发布

登峰小蚁

最新推荐文章于 2022-11-18 20:43:16 发布

阅读量2.4k

点赞数

分类专栏： # Hadoop 文章标签： mapreduce hadoop

本文链接：https://blog.csdn.net/wrongyao/article/details/104864473

版权

Hadoop 专栏收录该内容

12 篇文章 1 订阅

订阅专栏

文章目录

1. 需求

mapper将结果发送到Reducer会进行数据分组，默认是分在同一组，有时候会根据不通的业务需求进行分组。
注：默认的分组逻辑，(key.hashCode() & Integer.MAX_VALUE)表示用key的hash值对最大整数+1取余，然再对任务数（默认1）取余数（都是0），所以默认都是一个组

(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks

2. 实现步骤

两步：1. 自定义分组实现类； 2. job类中设置自定义分组实现类，设置任务数

2.1 实体类

@Data
public class FlowBean implements WritableComparable<FlowBean> {
    private String number;
    private long upFlow;
    private long downFlow;
    private long sumFlow;

    public FlowBean(String number, long upFlow, long downFlow) {
        this.number = number;
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = upFlow + downFlow;
    }

    public FlowBean() {
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(number);
        out.writeLong(upFlow);
        out.writeLong(downFlow);
        out.writeLong(sumFlow);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.number = in.readUTF();
        this.upFlow = in.readLong();
        this.downFlow = in.readLong();
        this.sumFlow = in.readLong();
    }

    @Override
    public String toString() {
        return number + " " + upFlow + " " + downFlow + " " + sumFlow;
    }

    @Override
    public int compareTo(FlowBean o) {
        if (this.sumFlow > o.getSumFlow()) {
            return -1;
        } else if (this.sumFlow < o.getSumFlow()) {
            return 1;
        } else {
            return 0;
        }
    }
}

注：因为这个例子中mapper程序输入的是FlowBean实体类，从Mapper到Reducer会自动将key进行一轮排序，所以自定义实体类的时候一定要实现WritableComparable接口

2.2 Mapper程序

public class GroupFlowMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] args = line.split("\\s");
        String number = args[0];
        Long upFlow = Long.parseLong(args[1]);
        Long downFlow = Long.parseLong(args[2]);
        FlowBean flowBean = new FlowBean(number, upFlow, downFlow);
        context.write(flowBean, NullWritable.get());
    }
}

2.3 自定义Partitioner

public class NumPartitioner extends Partitioner<FlowBean, NullWritable> {
    private static Map<String, Integer> map;

    static {
        map = new HashMap<>();
        // 分组需要从0开始
        map.put("135", 0);
        map.put("136", 1);
        map.put("137", 2);
        map.put("138", 3);
    }

    @Override
    public int getPartition(FlowBean flowBean, NullWritable nullWritable, int numPartitions) {
        Integer partitionNum = map.get(flowBean.getNumber().substring(0, 3));
        return partitionNum == null ? 4 : partitionNum;
    }
}

注：分组组号需要从0开始

2.4 Reducer程序

public class GroupFlowReducer extends Reducer<FlowBean, NullWritable, FlowBean, NullWritable> {
    @Override
    protected void reduce(FlowBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        for (NullWritable value : values) {
            context.write(key, value);
        }
    }
}

2.5 执行job

public class GroupFlowRunner extends Configured implements Tool {
    @Override
    public int run(String[] args) throws Exception {
        Configured conf = new Configured();
        Job job = Job.getInstance();
        job.setJarByClass(GroupFlowRunner.class);

        job.setMapperClass(GroupFlowMapper.class);
        job.setReducerClass(GroupFlowReducer.class);

        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(FlowBean.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 设置自定义Partitioner
        job.setPartitionerClass(NumPartitioner.class);
        // 任务数1或者大于等于分组数都可以
        job.setNumReduceTasks(5);

        return job.waitForCompletion(true) ? 1 : 0;
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new Configuration(), new GroupFlowRunner(), args);
    }
}