Hadoop系列（四）MapReduce 传递自定义bean并对结果排序、对mapper和reducer并发数的分析...

最新推荐文章于 2022-12-19 14:35:21 发布

weixin_34221036

最新推荐文章于 2022-12-19 14:35:21 发布

阅读量385

点赞数

文章标签：大数据 java

原文链接：https://segmentfault.com/a/1190000016155399

版权

案例要求：实现对手机号的上行和下行流量统计并分组
测试数据如下图所示：

分析：
通常情况下，mapper和reducer的输入输出类型可以为LongWritable,Text等，如果我们要传递自定义的bean，则需要符合hadoop的序列化规范。查看LongWritable源码可以看到其实现了WritableComparable<LongWritable>接口：

/** A WritableComparable for longs. */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class LongWritable implements WritableComparable<LongWritable> {...}

同理，我们自定义的bean要想被hadoop的mapreduce框架传递，也需要实现同样的接口。实际上WritableComparable接口是Writable和Comparable接口的组合接口，分别使bean能够被序列化和比较:

@InterfaceAudience.Public
@InterfaceStability.Stable
public interface WritableComparable<T> extends Writable, Comparable<T> {
}

不同于jdk默认的序列化方式，hadoop中剔除了对bean的继承结构和实现接口的序列化，只保留了bean内部的字段，节省了网络传输的带宽。

接下来我们就自己实现一个符合hadoop序列化规范的bean。

FlowBean:
（省略了对应字段的setter和getter方法）

public class FlowBean implements Writable {
    
    private String phone;
    private long upStream;
    private long downStream;
    private long sumStream;
    
    /**
     * 在反序列化时，反射机制需要调用空参构造函数
     */
    public FlowBean() {}
    
    public FlowBean(String phone, long upStream, long downStream) {
        super();
        this.phone = phone;
        this.upStream = upStream;
        this.downStream = downStream;
        this.sumStream = upStream + downStream;
    }

    /**
     * 从数据流中反序列化出对象的数据
     * 读取对象的顺序必须与序列化时的字段顺序一致
     */
    public void readFields(DataInput input) throws IOException {
        phone = input.readUTF();
        upStream = input.readLong();
        downStream = input.readLong();
        sumStream = input.readLong();
    }

    /**
     * 将对象序列化到流中
     */
    public void write(DataOutput output) throws IOException {
        output.writeUTF(phone);
        output.writeLong(upStream);
        output.writeLong(downStream);
        output.writeLong(sumStream);
    }

    /**
     * reduce结果输出格式
     */
    @Override
    public String toString() {
        return "" + upStream + "\t" + downStream + "\t" + sumStream;
    }

}

FlowMapper:

public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
    
    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        
        String line = value.toString();
        String[] fields = StringUtils.split(line, "\t");
        
        String phone = fields[1];
        long upStream = Long.parseLong(fields[7]);
        long downStream = Long.parseLong(fields[8]);
        
        context.write(new Text(phone), new FlowBean(phone, upStream, downStream));
        
    }

}

FlowReducer:

public class FlowReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
    
    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Context context)
            throws IOException, InterruptedException {
        
        long upStreamCounter = 0;
        long downStreamCounter = 0;
        
        for (FlowBean bean: values) {
            upStreamCounter += bean.getUpStream();
            downStreamCounter += bean.getDownStream();
        }
        
        context.write(key, new FlowBean(key.toString(), upStreamCounter, downStreamCounter));
        
    }

}

FlowRunner:
Runner的标准实现方式是继承Configured类并实现Tool接口。

public class FlowRunner extends Configured implements Tool{

    public int run(String[] args) throws Exception {
        
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        
        job.setJarByClass(FlowRunner.class);
        
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);
        
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);
        
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        return job.waitForCompletion(true)?0:1;
    }
    
    public static void main(String[] args) throws Exception {
        int res = ToolRunner.run(new Configuration(), new FlowRunner(), args);
        System.exit(res);
    }

}

对输出结果的排序，例如，按照总流量由高到低排序，那么FlowBean可以直接实现WritableComparable接口：

public class FlowBean implements WritableComparable<FlowBean> {...}

同时重写compareTo方法：

    public int compareTo(FlowBean o) {
        return sumStream > o.sumStream ? -1 : 1;
    }

修改mapper和reducer代码如下：

public class SortMR {
    
    public static class SortMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            
            String line = value.toString();
            String[] fields = StringUtils.split(line, "\t");
            String phone = fields[0];
            long upStream = Long.parseLong(fields[1]);
            long downStream = Long.parseLong(fields[2]);
            
            context.write(new FlowBean(phone, upStream, downStream), NullWritable.get());
        }
    }
    
    public static class SortReducer extends Reducer<FlowBean, NullWritable, Text, FlowBean> {
        @Override
        protected void reduce(FlowBean key, Iterable<NullWritable> values, Context context)
                throws IOException, InterruptedException {
            
            String phone = key.getPhone();
            context.write(new Text(phone), key);
        }
    }
    
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);
        
        job.setJarByClass(SortMR.class);
        
        job.setMapperClass(SortMapper.class);
        job.setReducerClass(SortReducer.class);
        
        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(NullWritable.class);
        
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);
        
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        
        System.exit(job.waitForCompletion(true)?0:1);
    }
}

提交到yarn中执行结果如下：

如果要对执行结果进行分组，即，不同区段的手机号流量统计结果输出到不同的文件中，我们需要设置Reducer的并发任务数。
首先，自定义一个partitioner类，如下：

public class AreaPartitioner<KEY, VALUE> extends Partitioner<KEY, VALUE>{

    private static HashMap<String,Integer> areaMap = new HashMap<String, Integer>();
    
    static{
        areaMap.put("135", 0);
        areaMap.put("136", 1);
        areaMap.put("137", 2);
        areaMap.put("138", 3);
        areaMap.put("139", 4);
    }
    
    @Override
    public int getPartition(KEY key, VALUE value, int numPartitions) {
        //从key中拿到手机号，查询手机归属地字典，不同的省份返回不同的组号
        int areaCoder  = areaMap.get(key.toString().substring(0, 3))==null?5:areaMap.get(key.toString().substring(0, 3));
        return areaCoder;
    }

}

然后在configuration中进行如下配置：

    // 设置自定义的分组逻辑定义
    job.setPartitionerClass(AreaPartitioner.class);
    
    // 设置reduce任务的并发数，应与分组的数量保持一致；如果多余分组数量，会产生空的reducer结果文件，不会报错；
    // 如果少于分组数量，则会报错；如果设为1，则与默认情况相同，只会有一个reducer进程执行，产生一个reducer结果。
    job.setNumReduceTasks(6);

这样job执行的结果将如下所示：

如果将测试数据复制四份，如下：

提交任务到yarn处理，在map任务启动并尚未完成之前查看java进程：

可以发现同时会有5个YarnChild进程在执行map任务。由于每个小文件都会占据一个block，每个block需要一个进程进行map任务处理，如此文件数目越多，map任务进程越多，消耗资源越多，效率越低。

实际上，map任务的并发数是由切片的数量决定的。有多少个切片，就启动多少map任务去执行。切片是一个逻辑概念，指的就是文件中数据的偏移量。切片的具体大小应该根据所处理的文件大小来调整。具体的从map到reduce输入输出任务处理过程，称为shuffle。

weixin_34221036

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop系列（四）MapReduce 传递自定义bean并对结果排序、对mapper和reducer并发数的分析...

案例要求：实现对手机号的上行和下行流量统计并分组测试数据如下图所示：分析：通常情况下，mapper和reducer的输入输出类型可以为LongWritable,Text等，如果我们要传递自定义的bean，则需要符合hadoop的序列化规范。查看LongWritable源码可以看到其实现了WritableComparable<Lo...
复制链接

扫一扫