025_MapReduce样例Hadoop TopKey算法

最新推荐文章于 2022-06-27 19:43:31 发布

sunseazhu

最新推荐文章于 2022-06-27 19:43:31 发布

阅读量480

点赞数

分类专栏： Hadoop 文章标签： MapReduce TopKey Treeset

本文链接：https://blog.csdn.net/u011528448/article/details/50962244

版权

1、需求说明

2、某个文件中某列数据的最大值。

思路：对每一个列的值依次进行比较，保存最大的值进行输出，算法的思想类似于排序算法（快速和冒泡排序）。

Mapper：因为只是在wordcount统计的基础上统计的，只是针对一个列，故可以找到最大值后，将最大值和对应的text给，cleanup中的context.write（）方法，然后输出。此时不需要Reducer。

  1 package org.dragon.hadoop.mapreduce.app.topk;
  2 
  3 import java.io.IOException;
  4 
  5 import org.apache.hadoop.conf.Configuration;
  6 import org.apache.hadoop.fs.Path;
  7 import org.apache.hadoop.io.LongWritable;
  8 import org.apache.hadoop.io.Text;
  9 import org.apache.hadoop.mapreduce.Job;
 10 import org.apache.hadoop.mapreduce.Mapper;
 11 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
 13 
 14 /**
 15  * 功能：某个文件中某列数据的最大值某个文件中
 16  * 
 17  * 针对wordcount程序输出的单词统计信息，求出单词出现频率最高的那个。 即：求给定的键值对中，value的最大值
 18  * @author ZhuXY
 19  * @time 2016-3-12 下午3:43:23
 20  * 
 21  */
 22 public class TopKMapReduce {
 23 
 24     /*
 25      * ******************************************************
 26      * 这个程序很好解释了splitsize对应一个map task，而一行数据对应一个map()函数。 即一个map task对应几个map（）函数
 27      * ******************************************************
 28      */
 29 
 30     // Mapper class
 31     static class TopKMapper extends
 32             Mapper<LongWritable, Text, Text, LongWritable> {
 33         // map output key
 34         private Text mapOutputKey = new Text();//java的变量（对象）使用前一定要先创建
 35 
 36         // map output value
 37         private LongWritable mapOutputValue = new LongWritable(); 
 38         
 39         /*
 40          * ********************************
 41          * 此处创建对所有的map()函数有效
 42          * *******************************
 43          */
 44         
 45 
 46         // store max value,init long.MIN_VALUE
 47         private long topKValue = Long.MIN_VALUE;
 48 
 49         @Override
 50         protected void map(LongWritable key, Text value, Context context)
 51                 throws IOException, InterruptedException {
 52             // get value
 53             String lineValue =value.toString();
 54             String[] str = lineValue.split("\t");
 55 
 56             Long tempValue = Long.valueOf(str[1]);
 57 
 58             // comparator
 59             if (topKValue < tempValue) {
 60                 topKValue = tempValue;
 61                 // set mapout key当找到相对的最大值给topKValue时，将该单词同时赋值给输出key
 62                 mapOutputKey.set(str[0]);
 63             }
 64 
 65             // 此处的context不需要填写，查看源码发现context是个内部类，源码中是由cleanup负责处理
 66         }
 67 
 68         @Override
 69         protected void setup(Context context) throws IOException,
 70                 InterruptedException {
 71             super.setup(context);
 72         }
 73 
 74         @Override
 75         protected void cleanup(Context context) throws IOException,
 76                 InterruptedException {
 77             // set map output value
 78             mapOutputValue.set(topKValue);
 79 
 80             // set mapoutput context
 81             context.write(mapOutputKey, mapOutputValue);
 82         }
 83     }
 84 
 85     // Driver Code
 86     public int run(String[] args) throws Exception, IOException,
 87             InterruptedException {
 88         // get conf
 89         Configuration conf = new Configuration();
 90 
 91         // create job
 92         Job job = new Job(conf, TopKMapReduce.class.getSimpleName());
 93 
 94         // set job
 95         job.setJarByClass(TopKMapReduce.class);
 96         // 1) input
 97         Path inputDirPath = new Path(args[0]);
 98         FileInputFormat.addInputPath(job, inputDirPath);
 99 
100         // 2) map
101         job.setMapperClass(TopKMapper.class);
102         job.setMapOutputKeyClass(Text.class);
103         job.setMapOutputValueClass(LongWritable.class);
104 
105         // 3) reduce
106         // job.setReducerClass(DataTotalReducer.class);
107         // job.setOutputKeyClass(Text.class);
108         // job.setOutputValueClass(DataWritable.class);
109         job.setNumReduceTasks(0);// 因为本程序没有Reducer的过程，这里必须设置为0
110         
111         // 4) output
112         Path outputDir = new Path(args[1]);
113         FileOutputFormat.setOutputPath(job, outputDir);
114 
115         // submit job
116         boolean isSuccess = job.waitForCompletion(true);
117 
118         // return status
119         return isSuccess ? 0 : 1;
120     }
121 
122     // run mapreduce
123     public static void main(String[] args) throws Exception, IOException,
124             InterruptedException {
125         // set args
126         args = new String[] { "hdfs://hadoop-master:9000/wc/wcoutput",
127                 "hdfs://hadoop-master:9000/wc/output" };
128 
129         // run job
130         int status = new TopKMapReduce().run(args);
131         // exit
132         System.exit(status);
133     }
134 
135 }

View TopKMapReduce Code

3、某个文件某列数据的Top Key的值（最大或者最小）

思路：用一个TreeMap保存，TreeMap可以自动根据Key排序，故将出现的次数当做Key进行hash存储。然后TreeMap.size()>NUM时,删除最小的就行了。

Mapper：在原有的基础上增加TreeMap

  1 package org.dragon.hadoop.mapreduce.app.topk;
  2 
  3 import java.io.IOException;
  4 import java.util.Iterator;
  5 import java.util.Set;
  6 import java.util.TreeMap;
  7 
  8 import org.apache.hadoo

最低0.47元/天解锁文章

sunseazhu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
025_MapReduce样例Hadoop TopKey算法

1、需求说明2、某个文件中某列数据的最大值。思路：对每一个列的值依次进行比较，保存最大的值进行输出，算法的思想类似于排序算法（快速和冒泡排序）。Mapper：因为只是在wordcount统计的基础上统计的，只是针对一个列，故可以找到最大值后，将最大值和对应的text给，cleanup中的context.write（）方法，然后输出。此时不需要Reducer。 1 p
复制链接

扫一扫