MapReduce求年度最高气温值以及combiner的一点认识

最新推荐文章于 2022-05-30 15:10:28 发布

逸卿

最新推荐文章于 2022-05-30 15:10:28 发布

阅读量962

点赞数

分类专栏： hadoop 文章标签： mapreduce combiner 分布式文件系统 map hdfs

hadoop 专栏收录该内容

22 篇文章 0 订阅

订阅专栏

MapReduce编程我是初学，下面这个例子是参考《Hadoop权威指南》第二章中的气象数据集中求取年份气温最高值的一个示例入门程序，so，如果您是老道级的高手就不用往下看了。其实这个例子的原理和统计词频是一样的，这里重点想说的还是关于combiner的一点认识，不过索性就将代码的实现写了吧。

MapReduce应用程序处理的数据是存放在HDFS（Hadoop distributed file system 即Hadoop分布式文件系统）中的，在将数据导入HDFS之前一般情况下我们都会对原始的数据进行预处理，主要是处于两方面的考虑：1> MapReduce程序对处理数量少的大型文件更容易更加高效，当在实际应用中如果你的输入集是由大量的小文件组成的，建议先对数据集合进行预处理，将其归并为数量相对较少的打文件，这样执行效率会更高。2> 原始数据集往往存在一些坏数据，这些数据不但对我们没有任何利用价值，而且有时还会影响程序的运行结果。基于此，由于气象数据集合就是由很多的小文件组成的，因此运行之前先将其进行预处理，一般采用各种脚本均可以实现，这里就不累赘了，只是说明预处理的必要性。

气象数据记录是由一串的数字组成，不同的位代表不通的含义，其中年份和温度在一条记录中均有体现，气象数据可以在专门的网站上下载示例数据，这里图方便我自己造了几条数据，并保存在sample.txt文件夹中，它的基本形式如下：

00670119909999919500515070049999999N9+00011+29999999999

其中，红色标记的部分分别代表年份和温度，下面就结合图示来分析一下MapReduce的执行过程：

如上图所示：我们可以知道整个MapReduce过程的大体流程为：HDFS数据集=》Map( key , value) => Reduce( key , Iterator<...> values) ，这里我们要注意到Map的输入和Reduce输入（即：Map的输出）的不同，以sample.txt文件作为输入为例，以行为单位Map看到的key是该行在文件中的偏移量，这是默认的，一般情况下我们不需要对key做什么处理，而value则对应着一行数据，Map的主要任务是对这行数据进行处理，这里我们就是要提取出年份year以及对应的气温temperature，而Map的输出则会像如图所示对应的key和value为：1950 ， [0 , 22 , -11] ，这是为什么呢？Map在处理完输出时系统会自动执行shuffle的阶段，即组合排序， 1950 0 , 1950 22 , 1950 -11经过shuffle就会成为：1950 [0 , 22 , -11]的形式，Reduce接到的输入对应的value是Iterator类型的，就是对应的[0 , 22 , -11]，然后进行max操作，就可以找出对应的年份的气温最大值。

实现代码：

[java]view plaincopy 
   
 import java.io.IOException;  
   
 import org.apache.hadoop.conf.Configuration;  
 import org.apache.hadoop.fs.Path;  
 import org.apache.hadoop.io.IntWritable;  
 import org.apache.hadoop.io.LongWritable;  
 import org.apache.hadoop.io.Text;  
 import org.apache.hadoop.mapreduce.Job;  
 import org.apache.hadoop.mapreduce.Mapper;  
 import org.apache.hadoop.mapreduce.Reducer;  
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
   
 public class MaxTemperature {  
       
     static class MaxTemperatureMapper extends Mapper<LongWritable , Text , Text , IntWritable>{  
           
         public void map(LongWritable key , Text value , Context context) throws IOException , InterruptedException{  
               
             String line = value.toString();  
             String year = line.substring(15 , 19);   // 代表年份  
               
             int airTemperature ;   // 代表气温值  
               
             if(line.charAt(37) == '+'){  
                 airTemperature = Integer.parseInt(line.substring(38, 42));  
             }else{  
                 airTemperature = Integer.parseInt(line.substring(37, 42));  
             }  
               
             context.write(new Text(year) , new IntWritable(airTemperature));  
       
         }  
           
     }  
       
     static class MapTemperatureReducer extends Reducer<Text , IntWritable, Text , IntWritable>{  
           
          public void reduce(Text key , Iterable<IntWritable> values , Context context) throws IOException , InterruptedException{  
                
              int maxValues = Integer.MIN_VALUE ;  
                
              for(IntWritable value : values){  
                    
                  maxValues = Math.max(maxValues, value.get());  
                    
              }  
   
              context.write(key, new IntWritable(maxValues));  
               
          }  
           
     }  
       
     public static void main(String[] args) throws Exception{  
           
       Configuration conf = new Configuration();  
       conf.set("mapred.job.tracker","192.168.1.252:9001");  
   
       Job job = new Job(conf, "Max temperature");  
       
       job.setJarByClass(MaxTemperature.class);  
   
       job.setMapperClass(MaxTemperatureMapper.class);  
   
       job.setReducerClass(MapTemperatureReducer.class);  
         
       job.setOutputKeyClass(Text.class);  
   
       job.setOutputValueClass(IntWritable.class);  
   
       FileInputFormat.addInputPath(job, new Path("hdfs://192.168.1.252:9000/user/root/input/sample.txt"));  
   
       FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.1.252:9000/user/root/output2"));  
         
       System.exit(job.waitForCompletion(true) ? 0 : 1);  
   
     }  
   
 }

下面再说一下combiner，Hadoop允许用户声明一个combiner，它运行在map的输出上，该函数的输出作为reduce函数的输入。我们以找最大值图示说明一下：

可以这样理解，combiner和reducer是做的同样的工作，但你所要决定的是：对整个集合执行的操作，能否分解为对经过combiner后产生的新集合操作的结果相同，而combiner执行的操作和reducer执行的操作是相同的，即：局部处理对全局处理是否有影响，举个例子来说：

求最大值：max(0 , 20 , 10 , 25 , 15 , 3 , 9 , 35) = max( max(0 , 20 , 10) , max(25 , 15) , max(3 , 9 , 35) ) = max(20 , 25 , 35) = 35 这个表达式描述了上面执行combiner的过程，我们都知道，max(0 , 20 , 10 , 25 , 15 , 3 , 9 , 35)和max( max(0 , 20 , 10) , max(25 , 15) , max(3 , 9 , 35) )是肯定相等的，那这个过程我们就可以采用combiner优化。

但是考虑一下求平均值：mean(0 , 20 , 10 , 25 , 15 , 3 , 9 , 35) = 14.625 , 而如果采用combiner的形式，则求均值过程变为：

mean(0 , 20 , 10 , 25 , 15 , 3 , 9 , 35) = mean( mean(0 , 20 , 10) , mean(25 , 15) , mean(3 , 9 , 35) ) = 15.23 ，显然是不相等的，这种情况下我们就不可以采用combiner的方式。

combiner是用reducer接口来定义的，如果采用combiner就要在配置中做如下设置：

conf.setCombinerClass( MaxTemperatureReducer.class ) ; 通常处理过程和Reducer一样。

上面就是对第二章学习的总结，纯属个人观点，理解的可能存在错误，加油，继续每天的学习！！

逸卿

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
MapReduce求年度最高气温值以及combiner的一点认识

MapReduce编程我是初学，下面这个例子是参考《Hadoop权威指南》第二章中的气象数据集中求取年份气温最高值的一个示例入门程序，so，如果您是老道级的高手就不用往下看了。其实这个例子的原理和统计词频是一样的，这里重点想说的还是关于combiner的一点认识，不过索性就将代码的实现写了吧。 MapReduce应用程序处理的数据是存放在HDFS（Hadoop distrib
复制链接

扫一扫

专栏目录