MapReduce之Top10

最新推荐文章于 2024-05-06 18:24:51 发布

路人张的鱼生

最新推荐文章于 2024-05-06 18:24:51 发布

阅读量3.9k

点赞数 2

分类专栏： MapReduce 文章标签： MapReduce 大数据

本文链接：https://blog.csdn.net/zhangdy12307/article/details/89412892

版权

MapReduce 专栏收录该内容

41 篇文章 8 订阅

订阅专栏

MapReduce之Top10

模式描述

Top10顾名思义不管输入数据的大小是多少，都以精准的输出按照规则的前10个结果，在普通的过滤模式中，输出数据的数量有输入数据决定

目的

无论数据集的大小如何，根据数据集的排序方案，得到一个相对较小的topK记录

适用场景

异类分析
选取感兴趣的数据
引人注目的指示版面
微博热搜

性能分析

Top10模式的性能通常是非常好的，不过需要注意的是，不管这个模式需要处理多少个记录数，只能使用一个reduce，如果Top值太大的话，毕竟reducer阶段只在一个机器上运行，会使得处理效率能变得更加低效

问题描述

在用户数据中查找声望最高的前十个用户

样例输入

创建数据集的代码如下：
生成的数据集表示一个

import java.io.*;
import java.util.Random;

public class create {
    public static void main(String[] args) throws IOException{
        String path="input/file.txt";
        File file=new File(path);
        if(!file.exists()){
            file.getParentFile().mkdirs();
        }
        file.createNewFile();
        FileWriter fw=new FileWriter(file,true);
        BufferedWriter bw=new BufferedWriter(fw);

        for(int i=0;i<5000;i++){
            int id=(int)(Math.random()*1000+1000);
            bw.write("Username = "+i+"   Reputation = "+(int)(Math.random()*1000+1000)+"\n");
        }
        bw.flush();
        bw.close();
        fw.close();
    }
}

运行结果如下

五千条数据
在这里插入图片描述

样例输出

在这里插入图片描述

mapper阶段任务

mapper处理所有的输入记录并存储在TreeMap结构中，TreeMap是基于键排序的子类，当TreeMap结构中所存储的记录超过10条时，第一个元素（即最小值）将被删除，当所有的记录处理完后，TreeMap中的记录将在cleanup方法中输出到reducer中，

mapper阶段编码如下

public static class TopMapper extends Mapper<Object,Text,NullWritable,Text>{
        private String indexStr="Reputation";
        private TreeMap<Integer,Text> topMap=new TreeMap<Integer, Text>();
        public void map(Object key,Text value,Context context) throws IOException,InterruptedException{
            String line=value.toString();
            //获取声望值
            String reputation=line.substring(line.indexOf(indexStr)+13);
            topMap.put(Integer.parseInt(reputation),new Text(line));
            System.out.println(reputation);
            if(topMap.size()>10){
                topMap.remove(topMap.firstKey());
            }
        }
        protected void cleanup(Context context) throws IOException,InterruptedException{
            for(Text t:topMap.values()){
                context.write(NullWritable.get(),new Text(t));
            }
        }
    }

reducer阶段任务

在这个过程中，我们通过job.setNumReduceTasks(1)配置使得作业中只有一个reduce，并且使用NullWritable作为键，reducer遍历所有记录并存储在一个TreeMap结构中，通过设置是的TreeMap按键值从大到小存储，当TreeMap结构中的记录超过10条时，最后一个元素将被删除。所有值遍历结束后输出结果

reducer阶段编码如下

public static class TopReducer extends Reducer<NullWritable,Text,NullWritable,Text>{
        private String indexStr="Reputation";
        private TreeMap<Integer,Text> topMap=new TreeMap<Integer, Text>(Collections.reverseOrder());
        public void reduce(NullWritable key,Iterable<Text> values,Context context) throws IOException,InterruptedException{
            for(Text value:values){
                String line=value.toString();
                String reputation=line.substring(line.indexOf(indexStr)+13);
                topMap.put(Integer.parseInt(reputation),new Text(value));
                if(topMap.size()>10){
                    topMap.remove(topMap.lastKey());
                }
            }
            for(Text t:topMap.values()){
                context.write(NullWritable.get(),t);
            }
        }
    }

完整代码如下

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
import java.util.Collections;
import java.util.TreeMap;

public class Top10 {
    public static class TopMapper extends Mapper<Object,Text,NullWritable,Text>{
        private String indexStr="Reputation";
        private TreeMap<Integer,Text> topMap=new TreeMap<Integer, Text>();
        public void map(Object key,Text value,Context context) throws IOException,InterruptedException{
            String line=value.toString();
            String reputation=line.substring(line.indexOf(indexStr)+13);
            topMap.put(Integer.parseInt(reputation),new Text(line));
            System.out.println(reputation);
            if(topMap.size()>10){
                topMap.remove(topMap.firstKey());
            }
        }
        protected void cleanup(Context context) throws IOException,InterruptedException{
            for(Text t:topMap.values()){
                context.write(NullWritable.get(),new Text(t));
            }
        }
    }
    public static class TopReducer extends Reducer<NullWritable,Text,NullWritable,Text>{
        private String indexStr="Reputation";
        private TreeMap<Integer,Text> topMap=new TreeMap<Integer, Text>(Collections.reverseOrder());
        public void reduce(NullWritable key,Iterable<Text> values,Context context) throws IOException,InterruptedException{
            for(Text value:values){
                String line=value.toString();
                String reputation=line.substring(line.indexOf(indexStr)+13);
                topMap.put(Integer.parseInt(reputation),new Text(value));
                if(topMap.size()>10){
                    topMap.remove(topMap.lastKey());
                }
            }
            for(Text t:topMap.values()){
                context.write(NullWritable.get(),t);
            }
        }
    }
    public static void main(String[] args) throws Exception{
        FileUtil.deleteDir("output");
        Configuration configuration=new Configuration();
        String[] otherArgs=new String[]{"input/file.txt","output"};
        if(otherArgs.length!=2){
            System.err.println("参数错误");
            System.exit(2);
        }
        Job job=new Job(configuration,"Inverse");
        job.setJarByClass(Top10.class);
        job.setMapperClass(TopMapper.class);
        job.setReducerClass(TopReducer.class);
        job.setNumReduceTasks(1);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(job,new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job,new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true)?0:1);
    }
}

遇到的问题

无，代码一次运行通过，而且也解决了控制台无法显示进程的问题

路人张的鱼生

关注

2
点赞
踩
18

收藏

觉得还不错? 一键收藏
8
评论
MapReduce之Top10

MapReduce之Top10模式描述Top10顾名思义不管输入数据的大小是多少，都以精准的输出按照规则的前10个结果，在普通的过滤模式中，输出数据的数量有输入数据决定目的无论数据集的大小如何，根据数据集的排序方案，得到一个相对较小的topK记录适用场景异类分析选取感兴趣的数据引人注目的指示版面微博热搜性能分析Top10模式的性能通常是非常好的，不过需要注意的是，不管这个...
复制链接

扫一扫

专栏目录