MapReduce TopN问题

最新推荐文章于 2024-07-21 02:58:52 发布

向阳争渡

最新推荐文章于 2024-07-21 02:58:52 发布

阅读量689

点赞数

分类专栏：大数据/Hadoop 文章标签： mapreduce

本文链接：https://blog.csdn.net/yangyang_yangqi/article/details/78865988

版权

大数据/Hadoop 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

分析：利用MapReduce如何实现类似Wordcount的TopN问题
数据源：

问题难点：

(1)Reduce端TreeSet方法进阶
(2)Reduce中Iterable迭代数据
引申：Reduce端只能遍历一次

较简单的方法是使用内置的TreeMap或者TreeSet。这两种是基于红黑树的一种数据结构，内部维持的事key的次序，但每次添加新元素，其排序的开销要大于堆调整的开销。找最大的Top N 元素，创建的就是小顶堆。小顶堆的特性是根节点是最小元素。不需要对堆进行排序，当堆的根节点被替换成新的节点时，需要进行堆化，以保持小顶堆的特性。

TreeMap不指定排序器的情况下，默认按照key值进行升序排列。

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.TreeSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class TopN implements Tool{

    public static class mapper extends Mapper<LongWritable, Text, Text, LongWritable>{
        public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
            String[] strings = value.toString().split("\t");
            context.write(new Text(strings[1].trim()), new LongWritable(Integer.valueOf(strings[2].trim())));
        }
    }

    public static class reduce extends Reducer<Text,LongWritable,Text,LongWritable>{
        public void reduce(Text key,Iterable<LongWritable> values,Context context) throws IOException, InterruptedException{
            TreeSet<Long> tSet= new TreeSet<Long>();
            for(LongWritable value:values){
                tSet.add(value.get());
            }
            if(tSet.size() > 3){
                tSet.remove(tSet.first());
            }

            for(Long num:tSet){
                context.write(key, new LongWritable(num));
            }

        }
    }

    static String input = "";
    static String output = "";

    public int run(String[] str) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {
        input = str[0];
        output = str[1];
        Configuration conf = new Configuration();
        FileSystem file = FileSystem.get(new URI(input), conf);
        Path outPath = new Path(output);
        if (file.exists(outPath)) {
            file.delete(outPath, true);
        }
        Job job = Job.getInstance();
        job.setJarByClass(TopN.class);
        FileInputFormat.setInputPaths(job, input);
        job.setInputFormatClass(TextInputFormat.class);

        job.setMapperClass(mapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setPartitionerClass(HashPartitioner.class);
        job.setNumReduceTasks(4);
        job.setReducerClass(reduce.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        FileOutputFormat.setOutputPath(job, outPath);
        job.setOutputFormatClass(TextOutputFormat.class);
        //用于提交未提交过得作业
        job.waitForCompletion(true);
        return 0;
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new TopN(), args);
    }

    public Configuration getConf() {
        return null;
    }
    public void setConf(Configuration arg0) {
    }
}

引申：Reduce端只能进行一次iterable(单向迭代一次)

虽然reduce方法会反复执行多次，但是key和value相关的对象只有两个，reduce会反复重用这两个对象(类似String是不可变对象的道理)。所以如果要保存key或者value的结果，只能将其中的值取出另存或者重新clone一个对象

public void reduce(Text host, Iterator<CrawlDatum> values, OutputCollector<Text, CrawlDatum> output, Reporter reporter) throws IOException {

    List<CrawlDatum> cache = new LinkedList<CrawlDatum>();
    // first loop and caching
    while (values.hasNext()) {
        CrawlDatum datum = values.next();
        doSomethingWithValue();
        CrawlDatum copy = new CrawlDatum();
        copy.set(datum);
        cache.add(copy);
    }
    // second loop
    for(IntWritable value:cache) {
        doSomethingElseThatCantBeDoneInFirstLoop(value);
    }
}

参考博客：
http://blog.csdn.net/zeb_perfect/article/details/53335207
Reduce iterable单向迭代问题
http://www.wangzhe.tech/MapReduce/MapReduce%E4%B8%ADreduce%E9%98%B6%E6%AE%B5iterator%E5%A6%82%E4%BD%95%E9%81%8D%E5%8E%86%E4%B8%A4%E9%81%8D%E5%92%8C%E6%89%80%E9%81%87%E5%88%B0%E7%9A%84%E9%97%AE%E9%A2%98/2016/07/13/