MapReduce TopN问题

分析:利用MapReduce如何实现类似Wordcount的TopN问题
数据源:

1   A   10  
2   A   40  
3   B   30  
4   C   20  
5   B   10  
6   D   40  
7   A   30  
8   C   20  
9   B   10  
10  D   40  
11  C   30  
12  D   20

问题难点:

(1)Reduce端TreeSet方法进阶
(2)Reduce中Iterable迭代数据
引申:Reduce端只能遍历一次

较简单的方法是使用内置的TreeMap或者TreeSet。这两种是基于红黑树的一种数据结构,内部维持的事key的次序,但每次添加新元素,其排序的开销要大于堆调整的开销。找最大的Top N 元素,创建的就是小顶堆。小顶堆的特性是根节点是最小元素。不需要对堆进行排序,当堆的根节点被替换成新的节点时,需要进行堆化,以保持小顶堆的特性。

TreeMap不指定排序器的情况下,默认按照key值进行升序排列。

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.util.TreeSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class TopN implements Tool{

    public static class mapper extends Mapper<LongWritable, Text, Text, LongWritable>{
        public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
            String[] strings = value.toString().split("\t");
            context.write(new Text(strings[1].trim()), new LongWritable(Integer.valueOf(strings[2].trim())));
        }
    }

    public static class reduce extends Reducer<Text,LongWritable,Text,LongWritable>{
        public void reduce(Text key,Iterable<LongWritable> values,Context context) throws IOException, InterruptedException{
            TreeSet<Long> tSet= new TreeSet<Long>();
            for(LongWritable value:values){
                tSet.add(value.get());
            }
            if(tSet.size() > 3){
                tSet.remove(tSet.first());
            }

            for(Long num:tSet){
                context.write(key, new LongWritable(num));
            }

        }
    }

    static String input = "";
    static String output = "";

    public int run(String[] str) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {
        input = str[0];
        output = str[1];
        Configuration conf = new Configuration();
        FileSystem file = FileSystem.get(new URI(input), conf);
        Path outPath = new Path(output);
        if (file.exists(outPath)) {
            file.delete(outPath, true);
        }
        Job job = Job.getInstance();
        job.setJarByClass(TopN.class);
        FileInputFormat.setInputPaths(job, input);
        job.setInputFormatClass(TextInputFormat.class);

        job.setMapperClass(mapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setPartitionerClass(HashPartitioner.class);
        job.setNumReduceTasks(4);
        job.setReducerClass(reduce.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        FileOutputFormat.setOutputPath(job, outPath);
        job.setOutputFormatClass(TextOutputFormat.class);
        //用于提交未提交过得作业
        job.waitForCompletion(true);
        return 0;
    }

    public static void main(String[] args) throws Exception {
        ToolRunner.run(new TopN(), args);
    }

    public Configuration getConf() {
        return null;
    }
    public void setConf(Configuration arg0) {
    }
}
引申:Reduce端只能进行一次iterable(单向迭代一次)

虽然reduce方法会反复执行多次,但是key和value相关的对象只有两个,reduce会反复重用这两个对象(类似String是不可变对象的道理)。所以如果要保存key或者value的结果,只能将其中的值取出另存或者重新clone一个对象

public void reduce(Text host, Iterator<CrawlDatum> values, OutputCollector<Text, CrawlDatum> output, Reporter reporter) throws IOException {

    List<CrawlDatum> cache = new LinkedList<CrawlDatum>();
    // first loop and caching
    while (values.hasNext()) {
        CrawlDatum datum = values.next();
        doSomethingWithValue();
        CrawlDatum copy = new CrawlDatum();
        copy.set(datum);
        cache.add(copy);
    }
    // second loop
    for(IntWritable value:cache) {
        doSomethingElseThatCantBeDoneInFirstLoop(value);
    }
}

参考博客:
http://blog.csdn.net/zeb_perfect/article/details/53335207
Reduce iterable单向迭代问题
http://www.wangzhe.tech/MapReduce/MapReduce%E4%B8%ADreduce%E9%98%B6%E6%AE%B5iterator%E5%A6%82%E4%BD%95%E9%81%8D%E5%8E%86%E4%B8%A4%E9%81%8D%E5%92%8C%E6%89%80%E9%81%87%E5%88%B0%E7%9A%84%E9%97%AE%E9%A2%98/2016/07/13/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值