使用的是Linux mint 15 系统,hadoop-1.2.1。
首先将hadoop配置为伪分布模式,具体请参考:配置hadoop伪分布模式。
现有一个文件num.txt,内容如下:
123
1
23
231
333
001
234
543
1111
每一行是一个数字,我们是要求出最大值。
我们在eclipse中建立项目maxnum-hadoop,在build path中加入外部jar文件:
~/hadoop-1.2.1/hadoop-core-1.2.1.jar
在该项目下建立并编辑以下三个文件:
MaxNumMapper.java:
import java.io.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class MaxNumMapper extends MapReduceBase
implements Mapper<LongWritable, Text, LongWritable, LongWritable>{
@Override
public void map(LongWritable key, Text value,
OutputCollector<LongWritable, LongWritable> output, Reporter reporter)
throws IOException {
String line = value.toString().trim();
Long num = Long.parseLong(line);
Long numKey = (long) 1;
if (line.length() != 0) {
output.collect(new LongWritable(numKey), new LongWritable(num)); //所有的key都是1
}
}
}
MaxNumReducer.java:
import java.io.*;
import java.util.Iterator;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class MaxNumReducer extends MapReduceBase
implements Reducer<LongWritable, LongWritable, LongWritable, LongWritable>{
@Override
public void reduce(LongWritable key, Iterator<LongWritable> values,
OutputCollector<LongWritable, LongWritable> output, Reporter reporter)
throws IOException {
long maxNum = Long.MIN_VALUE;
while (values.hasNext()) {
maxNum = Math.max(maxNum, values.next().get());
}
output.collect(key, new LongWritable(maxNum));
}
}
MaxNum.java:
/*
* 需要两个参数,第一个是用于输入数据的文本文件,第二个是输出目录名
*/
import java.io.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class MaxNum {
public static void main(String[] args) throws IOException {
if (args.length != 2) {
System.err.println("需要两个参数,第一个是用于输入数据的文本文件,第二个是输出目录名");
}
JobConf conf = new JobConf(MaxNum.class);
conf.setJobName("get max number");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxNumMapper.class);
conf.setReducerClass(MaxNumReducer.class);
conf.setOutputKeyClass(LongWritable.class);
conf.setOutputValueClass(LongWritable.class);
JobClient.runJob(conf);
}
}
代码编辑完毕,右键点击项目,打开Export,将项目导出为jar文件maxnum.jar。
下面开始测试代码:
运行start-dfs.sh、start-mapred.sh开启HDFS和Hadoop。将num.txt放入hdfs中:
$ hadoop fs -put num.txt .
$ hadoop fs -ls /
drwxr-xr-x - letian supergroup 0 2013-10-22 22:07 /test
drwxr-xr-x - letian supergroup 0 2013-10-22 21:58 /tmp
drwxr-xr-x - letian supergroup 0 2013-10-23 14:58 /user
$ hadoop fs -ls /user
drwxr-xr-x - letian supergroup 0 2013-10-23 14:58 /user/letian
$ hadoop fs -ls /user/letian
Found 1 items
-rw-r--r-- 3 letian supergroup 34 2013-10-23 14:58 /user/letian/num.txt
可以看到,我以sunlt用户进入linux系统,"hadoop fs -put num.txt . "中hdfs位置使用了相对路径,于是乎num.txt放入了/user/sunlt/num.txt中。
运行我们的MR程序:
$ hadoop jar maxnum.jar MaxNum num.txt result.txt
查看运行结果:
$ hadoop fs -ls /user/letian
Found 2 items
-rw-r--r-- 3 letian supergroup 34 2013-10-23 14:58 /user/letian/num.txt
drwxr-xr-x - letian supergroup 0 2013-10-23 14:59 /user/letian/result.txt
实际上result.txt是个目录,是我命名失误了。
然后,我们将result.txt存放到本地,并查看结果:
$ hadoop fs -copyToLocal /user/letian/result.txt result.txt
$ ls result.txt/
_logs/ part-00000 _SUCCESS
$ less result.txt/part-00000
$ cat result.txt/part-00000
1 1111
nice~