hadoop设计模式（二）topN问题

最新推荐文章于 2021-07-09 15:01:15 发布

kdb_viewer

最新推荐文章于 2021-07-09 15:01:15 发布

阅读量846

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/kdb_viewer/article/details/83184790

版权

hadoop 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

所谓topN问题，一个最简单的例子是在N个数中选出前k小的，使用快速选择算法可以得到一个平均 O(N) 的时间界，但是这个是主存排序，要将全部数据装载到内存，大数据场景下需要使用堆排序的思想，维护一个大小为k的最大堆，将其余N - k个数据插入最大堆中，若新插入的数据比堆顶数据大，就删掉堆顶的元素

一个常见的对于网站访问量的数据分析场景是，获取指定一段时间内，访问量前N的网站

topN问题形式化描述：

另N是一个整数，L是一个 List<Tuple2<T,interger>> ，T可以是任意类型，比如字符串或者url，有 L.size()=S ，L中的元素为 $\left \{ (K_{i},V_{i}),1\leq i\leq S \right \}$ ，其中 $K_{i}$ 类型为T， $V_{i}$ 类型为Integer，令 Sort(T) 返回已排序的L值，如下：

$\left \{ (A_{i},B_{j}),1\leq j\leq S,B_{1}\geq B_{2}\geq ...\geq B_{S} \right \}$

为了实现topN，使用java的TreeMap，即排序集合，和c++中的map一样，底层是红黑树

一个例子：cats包含3个属性，cat_id，cat_name，cat_weight，若在关系数据库中，使用SQL语句查询，SQL语句如下：

SELECT cat_id, cat_name, cat_weight FROM cats
    ORDER BY cat_weight DESC LIMIT 10;

在解决半结构化数据和大数据的情景下，无法使用关系数据库，MapReduce解决方案是，每个映射器找到一个本地topN列表，然后传到一个归约器，这个归约器从传送来的所有本地topN列表中找出最终的topN列表。这里很明显，只能使用一个归约器，因此归约器是性能瓶颈。

完整代码如下：

import java.io.IOException;
import java.util.StringTokenizer;
import java.util.TreeMap;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class top10cat {
    /* topN cat问题Mapper类 */
    public static class catmapper extends
        Mapper<Object, Text, NullWritable, Text> {
        /* 这是java的map，底层是红黑树 */
        private TreeMap<Double, String> recordmap = new TreeMap<Double, String>();
        public void map(Object key, Text value, Context context) {
            int N = 10;
            N = Integer.parseInt(context.getConfiguration().get("N"));
            String[] tokens = value.toString().split(",");
            Double weight = Double.parseDouble(tokens[0]);
            /* 解析输入文本后，插入到map中，并删掉超过N的 */
            recordmap.put(weight, value.toString());
            if (recordmap.size() > N)
                recordmap.remove(recordmap.firstKey());
        }

        /* cleanup方法用于将最终map中剩余的N个元素输出，注意每个mapper都会输出N个 */
        protected void cleanup(Context context) {
            for (String i : recordmap.values()) {
                try {
                    /* 使用NullWritable，这样保证所有mapper的输出只有一个reducer处理 */
                    context.write(NullWritable.get(), new Text(i));
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }
    }

    public static class catreducer extends
        Reducer<NullWritable, Text, NullWritable, Text> {
        private TreeMap<Double, String> recordmap = new TreeMap<Double, String>();
        public void reduce(NullWritable key, Iterable<Text> values,
        Context context) throws IOException, InterruptedException {
            int N = 10;
            N = Integer.parseInt(context.getConfiguration().get("N"));
            for (Text value : values) {
                String[] tokens = value.toString().split(",");
                Double weight = Double.parseDouble(tokens[0]);
                recordmap.put(weight, value.toString());
                /* 归约器要再次处理，因为可能有多个mapper */
                if (recordmap.size() > N)
                    recordmap.remove(recordmap.firstKey());
            }

            for (String i : recordmap.values()) {
                context.write(NullWritable.get(), new Text(i));
            }
        }
    }

    public static void main(String[] args) throws Exception {
        if (args.length != 3) {
            throw new IllegalArgumentException(
            "!!!!!!!!!!!!!! Usage!!!!!!!!!!!!!!: hadoop jar <jar-name> "
            + "top10cat.top10cat "
            + "<the value of N>"
            + "<input-path> "
             + "<output-path>");
        }
        Configuration conf = new Configuration();
        conf.set("N", args[0]);
        Job job = Job.getInstance(conf, "top10cat");
        job.setJobName("top10cat");

        Path inputPath = new Path(args[1]);
        Path outputPath = new Path(args[2]);
        FileInputFormat.setInputPaths(job, inputPath);
        FileOutputFormat.setOutputPath(job, outputPath);

        job.setJarByClass(top10cat.class);
        job.setMapperClass(catmapper.class);
        job.setReducerClass(catreducer.class);
        job.setNumReduceTasks(1);

        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

输入文本如下：

[root@master chapter3]# hdfs dfs -cat /chapter3/top10cat.txt
12,cat1,cat1
13,cat2,cat2
14,cat3,cat3
15,cat4,cat4
10,cat5,cat5
100,cat100,cat100
200,cat200,cat200
300,cat300,cat300
1,cat001,cat001
67,cat67,cat67
22,cat22,cat22
23,cat23,cat23
1000,cat1000,cat1000
2000,cat2000,cat2000

输出如下：

[root@master chapter3]# hdfs dfs -cat output/*
14,cat3,cat3
15,cat4,cat4
22,cat22,cat22
23,cat23,cat23
67,cat67,cat67
100,cat100,cat100
200,cat200,cat200
300,cat300,cat300
1000,cat1000,cat1000
2000,cat2000,cat2000

若要按照从小到大筛选，那么修改map中remove为lastkey即可

kdb_viewer

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
hadoop设计模式（二）topN问题

所谓topN问题，一个最简单的例子是在N个数中选出前k小的，使用快速选择算法可以得到一个平均的时间界，但是这个是主存排序，要将全部数据装载到内存，大数据场景下需要使用堆排序的思想，维护一个大小为k的最大堆，将其余N - k个数据插入最大堆中，若新插入的数据比堆顶数据大，就删掉堆顶的元素一个常见的对于网站访问量的数据分析场景是，获取指定一段时间内，访问量前N的网站topN问题形式化描述：...
复制链接

扫一扫