Hadoop--两个简单的MapReduce程序

最新推荐文章于 2022-12-22 15:57:45 发布

qbyjxg001

最新推荐文章于 2022-12-22 15:57:45 发布

阅读量451

点赞数

分类专栏： HADOOP

HADOOP 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

转自：http://www.linuxidc.com/Linux/2013-08/88631.htm

这周在学习Hadoop编程，以前看过《Hadoop权威指南》这本书，但是看完了HDFS这一章之后，后面的内容就难以再看懂了，说实话，之前一直对MapReduce程序敬而远之，毫不理解这种类型的程序的执行过程。这一周花了些时间看了Hadoop的实战，现在能够看懂简单的MapReduce程序，也能自己动手写几个简单的例子程序。

相关阅读：

Hadoop权威指南(中文版-带目录索引)PDF http://www.linuxidc.com/Linux/2013-05/84948.htm
Hadoop权威指南（中文第2版）PDF http://www.linuxidc.com/Linux/2012-07/65972.htm

下面是两个简单的MapReduce程序，用到了一些简单的Hadoop知识点，总结如下文。

源码下载：

**************************************************************

下载在Linux公社的1号FTP服务器里，下载地址：

FTP地址：ftp://www.linuxidc.com/

用户名：www.6688.cc

密码：www.linuxidc.com

在 2013年LinuxIDC.com\8月\Hadoop--两个简单的MapReduce程序

下载方法见 http://www.linuxidc.net/thread-1187-1-1.html

**************************************************************

例子一求最大数

问题描述是这样的，从一系列数中，求出最大的那一个。这个需求应该说是很简单的，如果不用MapReduce来实现，普通的Java程序要实现这个需求，应该说是轻而易举的，几行代码就能搞定。这里用这个例子是想说说Hadoop中的Combiner的用法。

我们知道，Hadoop使用Mapper函数将数据处理成一个一个的<key, value>键值对，再在网络节点间对这些键值对进行整理（shuffle），然后使用Reducer函数处理这些键值对，并最终将结果输出。那么可以这样想，如果我们有1亿个数据（Hadoop就是为大数据而生），Mapper函数将会产生1亿个键值对在网络中进行传输，如果我们只是要求出这1亿个数当中的最大值，那么显然，Mapper只需要输出它所知道的最大值即可。这样一来可以减轻网络带宽的压力，二来，可以减轻Reducer的压力，提高程序的效率。

如果Reducer只是运行简单的诸如求最大值、最小值、计数，那么我们可以使用Combiner，但是，如果是求一组数的平均值，千万别用Combiner，道理很简单，你自己分析看。Combiner可以看作是Reducer的帮手，或者看成是Mapper端的Reducer，它能减少Mapper函数的输出从而减少网络数据传输并能减少Reducer上的负载。下面是Combiner的例子程序。

程序的输入是这样的：

12
5
9
21
43
99
65
32
10

MapReduce程序需要找到这一组数字中的最大值99，Mapper函数是这样的：

public class MyMapper extends Mapper<Object, Text, Text, IntWritable>{

@Override
protected void map(Object key, Text value,Context context)throws IOException, InterruptedException {
// TODO Auto-generated method stub
context.write(new Text(), new IntWritable(Integer.parseInt(value.toString())));
}

}

Mapper函数非常简单，它是负责读取HDFS中的数据的，负责将这些数据组成<key, value>对，然后传输给Reducer函数。Reducer函数如下：

public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

@Override
protected void reduce(Text key, Iterable<IntWritable> values,Context context)throws IOException, InterruptedException {
// TODO Auto-generated method stub
int temp = Integer.MIN_VALUE;
for(IntWritable value : values){
if(value.get() > temp){
temp = value.get();
}
}
context.write(new Text(), new IntWritable(temp));
}
}

Reducer函数也很简单，就是负责找到从Mapper端传来的数据中找到最大值。那么在Mapper函数与Reducer函数之间，有个Combiner，它的代码是这样的：

public class MyCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {

我们可以看到，combiner也是继承了Reducer类，其写法与写reduce函数一样，reduce和combiner对外的功能是一样的，只是使用时的位置和上下文（Context）不一样而已。定义好了自己的Combiner函数之后，需要在Job类中加入一行代码，告诉Job你使用要在Mapper端使用Combiner：

job.setCombinerClass(MyCombiner.class);

那么这个求最大数的例子的Job类是这样的：

public class MyMaxNum {

public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf,"My Max Num");
job.setJarByClass(MyMaxNum.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setCombinerClass(MyCombiner.class);
FileInputFormat.addInputPath(job, new Path("/huhui/nums.txt"));
FileOutputFormat.setOutputPath(job, new Path("/output"));
System.exit(job.waitForCompletion(true) ? 0:1);
}
}

当然你还可以对输出进行压缩。只要在函数中添加两行代码，就能对Reducer函数的输出结果进行压缩。当然这里没有必要对结果进行压缩，只是作为一个知识点而已。

//对输出进行压缩
conf.setBoolean("mapred.output.compress", true);
conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);

例子二自定义Key的类型

这个例子主要讲述如果自定义<key, value>的key的类型，以及如果如何使用Hadoop中的比较器WritableComparator和输入格式KeyValueTextInputFormat。

需求是这样的，给定下面一组输入：

str1 2
str2 5
str3 9
str1 1
str2 3
str3 12
str1 8
str2 7
str3 18

希望得到的输出如下：

str1 1,2,8
str2 3,5,7
str3 9,12,19

请注意，输入格式KeyValueTextInputFormat只能针对key和value中间使用制表符\t隔开的数据，而逗号是不行的。

对于这个需求，我们需要自定义一个key的数据类型。在Hadoop中，自定义的key值类型都要实现WritableComparable接口，然后重写这个接口的三个方法。这里我们定义IntPaire类，它实现了WritableComparable接口：

public class IntPaire implements WritableComparable<IntPaire> {

private String firstKey;
private int secondKey;

@Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
firstKey = in.readUTF();
secondKey = in.readInt();
}

@Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
out.writeUTF(firstKey);
out.writeInt(secondKey);
}

@Override
public int compareTo(IntPaire o) {
// TODO Auto-generated method stub
return o.getFirstKey().compareTo(this.firstKey);
}

public String getFirstKey() {
return firstKey;
}

public void setFirstKey(String firstKey) {
this.firstKey = firstKey;
}

public int getSecondKey() {
return secondKey;
}

public void setSecondKey(int secondKey) {
this.secondKey = secondKey;
}
}

上面重写的readFields方法和write方法，都是这样写的，几乎成为模板。

由于要将相同的key的键/值对送到同一个Reducer哪里，所以这里要用到Partitioner。在Hadoop中，将哪个key到分配到哪个Reducer的过程，是由Partitioner规定的，这是一个类，它只有一个抽象方法，继承这个类时要覆盖这个方法：

getPartition(KEY key, VALUE value, int numPartitions)

其中，第一个参数key和第二个参数value是Mapper端的输出<key, value>，第三个参数numPartitions表示的是当前Hadoop集群一共有多少个Reducer。输出则是分配的Reducer编号，就是指的是Mapper端输出的键对应到哪一个Reducer中去。我们一般实现Partitioner是哈希散列的方式，它以key的hash值对Reducer的数目取模，得到对应的Reducer编号。这样就能保证相同的key值，必定会分配到同一个reducer上。如果有N个Reducer，那么编号就是0,1,2,3......(N-1)。

那么在本例子中，Partitioner是这样实现的：

public class PartitionByText extends Partitioner<IntPaire, IntWritable> {

@Override
public int getPartition(IntPaire key, IntWritable value, int numPartitions) {//reduce的个数
// TODO Auto-generated method stub
return (key.getFirstKey().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}

本例还用到了Hadoop的比较器WritableComparator，它实现的是RawComparator接口。

public class TextIntComparator extends WritableComparator {

public TextIntComparator(){
super(IntPaire.class,true);
}

@Override
public int compare(WritableComparable a, WritableComparable b) {
// TODO Auto-generated method stub
IntPaire o1 = (IntPaire) a;
IntPaire o2 = (IntPaire) b;
if(!o1.getFirstKey().equals(o2.getFirstKey())){
return o1.getFirstKey().compareTo(o2.getFirstKey());
}else{
return o1.getSecondKey() - o2.getSecondKey();
}
}

}

由于我们在key中加入的额外的字段，所以在group的时候需要手工设置，手工设置很简单，因为job提供了相应的方法，在这里，我们的group比较器是这样实现的：

public class TextComparator extends WritableComparator {

public TextComparator(){
super(IntPaire.class,true);
}

@Override
public int compare(WritableComparable a, WritableComparable b) {
// TODO Auto-generated method stub
IntPaire o1 = (IntPaire) a;
IntPaire o2 = (IntPaire) b;
return o1.getFirstKey().compareTo(o2.getFirstKey());
}

}

下面将写出Mapper函数，它是以KeyValueTextInputFormat的输入形式读取HDFS中的数据，设置输入格式将在job中。

public class SortMapper extends Mapper<Object, Text, IntPaire, IntWritable>{

public IntPaire intPaire = new IntPaire();
public IntWritable intWritable = new IntWritable(0);

@Override
protected void map(Object key, Text value,Context context)throws IOException, InterruptedException {
// TODO Auto-generated method stub
int intValue = Integer.parseInt(value.toString());
intPaire.setFirstKey(key.toString());
intPaire.setSecondKey(intValue);
intWritable.set(intValue);
context.write(intPaire, intWritable);//key:str1 value:5
}
}

下面是Reducer函数，

public class SortReducer extends Reducer<IntPaire, IntWritable, Text, Text> {

@Override
protected void reduce(IntPaire key, Iterable<IntWritable> values,Context context)throws IOException, InterruptedException {
// TODO Auto-generated method stub
StringBuffer combineValue = new StringBuffer();
Iterator<IntWritable> itr = values.iterator();
while(itr.hasNext()){
int value = itr.next().get();
combineValue.append(value + ",");
}
int length = combineValue.length();
String str = "";
if(combineValue.length() > 0){
str = combineValue.substring(0, length-1);//去除最后一个逗号
}
context.write(new Text(key.getFirstKey()), new Text(str));
}

}

Job类是这样的：

public class SortJob {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "Sortint");
job.setJarByClass(SortJob.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);

//设置输入格式
job.setInputFormatClass(KeyValueTextInputFormat.class);

//设置map的输出类型
job.setMapOutputKeyClass(IntPaire.class);
job.setMapOutputValueClass(IntWritable.class);

//设置排序
job.setSortComparatorClass(TextIntComparator.class);

//设置group
job.setGroupingComparatorClass(TextComparator.class);//以key进行grouping

job.setPartitionerClass(PartitionByText.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path("/huhui/input/words.txt"));
FileOutputFormat.setOutputPath(job, new Path("/output"));
System.exit(job.waitForCompletion(true)?0:1);
}
}

这样一来，程序就写完了，按照需求，完成了相应的功能。

后记

刚开始接触MapReduce程序可能会感到无从下手，这可能是因为你还没有理解MapReduce的机制和原理。自己动手写写简单的MapReduce函数会有助于理解，然后逐步的深入学习。

更多Hadoop相关信息见Hadoop 专题页面 http://www.linuxidc.com/topicnews.aspx?tid=13

qbyjxg001

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop--两个简单的MapReduce程序

这周在学习Hadoop编程，以前看过《Hadoop权威指南》这本书，但是看完了HDFS这一章之后，后面的内容就难以再看懂了，说实话，之前一直对MapReduce程序敬而远之，毫不理解这种类型的程序的执行过程。这一周花了些时间看了Hadoop的实战，现在能够看懂简单的MapReduce程序，也能自己动手写几个简单的例子程序。相关阅读：Hadoop权威指南(中文版-带目录索引)PDF htt
复制链接

扫一扫

专栏目录