例子:求每年最高气温
在HDFS中的根目录下有以下文件格式: /input.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
|
2014010114
2014010216
2014010317
2014010410
2014010506
2012010609
2012010732
2012010812
2012010919
2012011023
2001010116
2001010212
2001010310
2001010411
2001010529
2013010619
2013010722
2013010812
2013010929
2013011023
2008010105
2008010216
2008010337
2008010414
2008010516
2007010619
2007010712
2007010812
2007010999
2007011023
2010010114
2010010216
2010010317
2010010410
2010010506
2015010649
2015010722
2015010812
2015010999
2015011023
|
比如:2010012325表示在2010年01月23日的气温为25度。现在要求使用MapReduce,计算每一年出现过的最大气温。
在写代码之前,先确保正确的导入了相关的jar包。我使用的是maven,可以到http://mvnrepository.com去搜索这几个artifactId。
此程序需要以Hadoop文件作为输入文件,以Hadoop文件作为输出文件,因此需要用到文件系统,于是需要引入hadoop-hdfs包;我们需要向Map-Reduce集群提交任务,需要用到Map-Reduce的客户端,于是需要导入hadoop-mapreduce-client-jobclient包;另外,在处理数据的时候会用到一些hadoop的数据类型例如IntWritable和Text等,因此需要导入hadoop-common包。于是运行此程序所需要的相关依赖有以下几个:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
<
dependency
>
<
groupId
>org.apache.hadoop</
groupId
>
<
artifactId
>hadoop-hdfs</
artifactId
>
<
version
>2.4.0</
version
>
</
dependency
>
<
dependency
>
<
groupId
>org.apache.hadoop</
groupId
>
<
artifactId
>hadoop-mapreduce-client-jobclient</
artifactId
>
<
version
>2.4.0</
version
>
</
dependency
>
<
dependency
>
<
groupId
>org.apache.hadoop</
groupId
>
<
artifactId
>hadoop-common</
artifactId
>
<
version
>2.4.0</
version
>
</
dependency
>
|
包导好了后, 设计代码如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
|
package
com.abc.yarn;
import
java.io.IOException;
import
org.apache.hadoop.conf.Configuration;
import
org.apache.hadoop.fs.Path;
import
org.apache.hadoop.io.IntWritable;
import
org.apache.hadoop.io.LongWritable;
import
org.apache.hadoop.io.Text;
import
org.apache.hadoop.mapreduce.Job;
import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
import
org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public
class
Temperature {
/**
* 四个泛型类型分别代表:
* KeyIn Mapper的输入数据的Key,这里是每行文字的起始位置(0,11,...)
* ValueIn Mapper的输入数据的Value,这里是每行文字
* KeyOut Mapper的输出数据的Key,这里是每行文字中的“年份”
* ValueOut Mapper的输出数据的Value,这里是每行文字中的“气温”
*/
static
class
TempMapper
extends
Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public
void
map(LongWritable key, Text value, Context context)
throws
IOException, InterruptedException {
// 打印样本: Before Mapper: 0, 2000010115
System.out.print(
"Before Mapper: "
+ key +
", "
+ value);
String line = value.toString();
String year = line.substring(
0
,
4
);
int
temperature = Integer.parseInt(line.substring(
8
));
context.write(
new
Text(year),
new
IntWritable(temperature));
// 打印样本: After Mapper:2000, 15
System.out.println(
"======"
+
"After Mapper:"
+
new
Text(year) +
", "
+
new
IntWritable(temperature));
}
}
/**
* 四个泛型类型分别代表:
* KeyIn Reducer的输入数据的Key,这里是每行文字中的“年份”
* ValueIn Reducer的输入数据的Value,这里是每行文字中的“气温”
* KeyOut Reducer的输出数据的Key,这里是不重复的“年份”
* ValueOut Reducer的输出数据的Value,这里是这一年中的“最高气温”
*/
static
class
TempReducer
extends
Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public
void
reduce(Text key, Iterable<IntWritable> values,
Context context)
throws
IOException, InterruptedException {
int
maxValue = Integer.MIN_VALUE;
StringBuffer sb =
new
StringBuffer();
//取values的最大值
for
(IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
sb.append(value).append(
", "
);
}
// 打印样本: Before Reduce: 2000, 15, 23, 99, 12, 22,
System.out.print(
"Before Reduce: "
+ key +
", "
+ sb.toString());
context.write(key,
new
IntWritable(maxValue));
// 打印样本: After Reduce: 2000, 99
System.out.println(
"======"
+
"After Reduce: "
+ key +
", "
+ maxValue);
}
}
public
static
void
main(String[] args)
throws
Exception {
//输入路径
String dst =
"hdfs://localhost:9000/intput.txt"
;
//输出路径,必须是不存在的,空文件加也不行。
String dstOut =
"hdfs://localhost:9000/output"
;
Configuration hadoopConfig =
new
Configuration();
hadoopConfig.set(
"fs.hdfs.impl"
,
org.apache.hadoop.hdfs.DistributedFileSystem.
class
.getName()
);
hadoopConfig.set(
"fs.file.impl"
,
org.apache.hadoop.fs.LocalFileSystem.
class
.getName()
);
Job job =
new
Job(hadoopConfig);
//如果需要打成jar运行,需要下面这句
//job.setJarByClass(NewMaxTemperature.class);
//job执行作业时输入和输出文件的路径
FileInputFormat.addInputPath(job,
new
Path(dst));
FileOutputFormat.setOutputPath(job,
new
Path(dstOut));
//指定自定义的Mapper和Reducer作为两个阶段的任务处理类
job.setMapperClass(TempMapper.
class
);
job.setReducerClass(TempReducer.
class
);
//设置最后输出结果的Key和Value的类型
job.setOutputKeyClass(Text.
class
);
job.setOutputValueClass(IntWritable.
class
);
//执行job,直到完成
job.waitForCompletion(
true
);
System.out.println(
"Finished"
);
}
}
|
上面代码中,注意Mapper类的泛型不是java的基本类型,而是Hadoop的数据类型Text、IntWritable。我们可以简单的等价为java的类String、int。
代码中Mapper类的泛型依次是<k1,v1,k2,v2>。map方法的第二个形参是行文本内容,是我们关心的。核心代码是把行文本内容按照空格拆分,把每行数据中“年”和“气温”提取出来,其中“年”作为新的键,“温度”作为新的值,写入到上下文context中。在这里,因为每一年有多行数据,因此每一行都会输出一个<年份, 气温>键值对。
下面是控制台打印结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
|
Before Mapper:
0
,
2014010114
======After Mapper:
2014
,
14
Before Mapper:
11
,
2014010216
======After Mapper:
2014
,
16
Before Mapper:
22
,
2014010317
======After Mapper:
2014
,
17
Before Mapper:
33
,
2014010410
======After Mapper:
2014
,
10
Before Mapper:
44
,
2014010506
======After Mapper:
2014
,
6
Before Mapper:
55
,
2012010609
======After Mapper:
2012
,
9
Before Mapper:
66
,
2012010732
======After Mapper:
2012
,
32
Before Mapper:
77
,
2012010812
======After Mapper:
2012
,
12
Before Mapper:
88
,
2012010919
======After Mapper:
2012
,
19
Before Mapper:
99
,
2012011023
======After Mapper:
2012
,
23
Before Mapper:
110
,
2001010116
======After Mapper:
2001
,
16
Before Mapper:
121
,
2001010212
======After Mapper:
2001
,
12
Before Mapper:
132
,
2001010310
======After Mapper:
2001
,
10
Before Mapper:
143
,
2001010411
======After Mapper:
2001
,
11
Before Mapper:
154
,
2001010529
======After Mapper:
2001
,
29
Before Mapper:
165
,
2013010619
======After Mapper:
2013
,
19
Before Mapper:
176
,
2013010722
======After Mapper:
2013
,
22
Before Mapper:
187
,
2013010812
======After Mapper:
2013
,
12
Before Mapper:
198
,
2013010929
======After Mapper:
2013
,
29
Before Mapper:
209
,
2013011023
======After Mapper:
2013
,
23
Before Mapper:
220
,
2008010105
======After Mapper:
2008
,
5
Before Mapper:
231
,
2008010216
======After Mapper:
2008
,
16
Before Mapper:
242
,
2008010337
======After Mapper:
2008
,
37
Before Mapper:
253
,
2008010414
======After Mapper:
2008
,
14
Before Mapper:
264
,
2008010516
======After Mapper:
2008
,
16
Before Mapper:
275
,
2007010619
======After Mapper:
2007
,
19
Before Mapper:
286
,
2007010712
======After Mapper:
2007
,
12
Before Mapper:
297
,
2007010812
======After Mapper:
2007
,
12
Before Mapper:
308
,
2007010999
======After Mapper:
2007
,
99
Before Mapper:
319
,
2007011023
======After Mapper:
2007
,
23
Before Mapper:
330
,
2010010114
======After Mapper:
2010
,
14
Before Mapper:
341
,
2010010216
======After Mapper:
2010
,
16
Before Mapper:
352
,
2010010317
======After Mapper:
2010
,
17
Before Mapper:
363
,
2010010410
======After Mapper:
2010
,
10
Before Mapper:
374
,
2010010506
======After Mapper:
2010
,
6
Before Mapper:
385
,
2015010649
======After Mapper:
2015
,
49
Before Mapper:
396
,
2015010722
======After Mapper:
2015
,
22
Before Mapper:
407
,
2015010812
======After Mapper:
2015
,
12
Before Mapper:
418
,
2015010999
======After Mapper:
2015
,
99
Before Mapper:
429
,
2015011023
======After Mapper:
2015
,
23
Before Reduce:
2001
,
12
,
10
,
11
,
29
,
16
, ======After Reduce:
2001
,
29
Before Reduce:
2007
,
23
,
19
,
12
,
12
,
99
, ======After Reduce:
2007
,
99
Before Reduce:
2008
,
16
,
14
,
37
,
16
,
5
, ======After Reduce:
2008
,
37
Before Reduce:
2010
,
10
,
6
,
14
,
16
,
17
, ======After Reduce:
2010
,
17
Before Reduce:
2012
,
19
,
12
,
32
,
9
,
23
, ======After Reduce:
2012
,
32
Before Reduce:
2013
,
23
,
29
,
12
,
22
,
19
, ======After Reduce:
2013
,
29
Before Reduce:
2014
,
14
,
6
,
10
,
17
,
16
, ======After Reduce:
2014
,
17
Before Reduce:
2015
,
23
,
49
,
22
,
12
,
99
, ======After Reduce:
2015
,
99
Finished
|
执行结果:
对分析的验证
从打印的日志中可以看出:
-
Mapper的输入数据(k1,v1)格式是:默认的按行分的键值对<0, 2010012325>,<11, 2012010123>...
-
Reducer的输入数据格式是:把相同的键合并后的键值对:<2001, [12, 32, 25...]>,<2007, [20, 34, 30...]>...
-
Reducer的输出数(k3,v3)据格式是:经自己在Reducer中写出的格式:<2001, 32>,<2007, 34>...
其中,由于输入数据太小,Map过程的第1阶段这里不能证明。但事实上是这样的。
结论中第一点验证了Map过程的第2阶段:“键”是每一行的起始位置(单位是字节),“值”是本行的文本内容。
另外,通过Reduce的几行
1
2
3
4
5
6
7
8
|
Before Reduce:
2001
,
12
,
10
,
11
,
29
,
16
, ======After Reduce:
2001
,
29
Before Reduce:
2007
,
23
,
19
,
12
,
12
,
99
, ======After Reduce:
2007
,
99
Before Reduce:
2008
,
16
,
14
,
37
,
16
,
5
, ======After Reduce:
2008
,
37
Before Reduce:
2010
,
10
,
6
,
14
,
16
,
17
, ======After Reduce:
2010
,
17
Before Reduce:
2012
,
19
,
12
,
32
,
9
,
23
, ======After Reduce:
2012
,
32
Before Reduce:
2013
,
23
,
29
,
12
,
22
,
19
, ======After Reduce:
2013
,
29
Before Reduce:
2014
,
14
,
6
,
10
,
17
,
16
, ======After Reduce:
2014
,
17
Before Reduce:
2015
,
23
,
49
,
22
,
12
,
99
, ======After Reduce:
2015
,
99
|
可以证实Map过程的第4阶段:先分区,然后对每个分区都执行一次Reduce(Map过程第6阶段)。
对于Mapper的输出,前文中提到:如果没有Reduce过程,Mapper的输出会直接写入文件。于是我们把Reduce方法去掉(注释掉第95行即可)。
再执行,下面是控制台打印结果:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
Before Mapper:
0
,
2014010114
======After Mapper:
2014
,
14
Before Mapper:
11
,
2014010216
======After Mapper:
2014
,
16
Before Mapper:
22
,
2014010317
======After Mapper:
2014
,
17
Before Mapper:
33
,
2014010410
======After Mapper:
2014
,
10
Before Mapper:
44
,
2014010506
======After Mapper:
2014
,
6
Before Mapper:
55
,
2012010609
======After Mapper:
2012
,
9
Before Mapper:
66
,
2012010732
======After Mapper:
2012
,
32
Before Mapper:
77
,
2012010812
======After Mapper:
2012
,
12
Before Mapper:
88
,
2012010919
======After Mapper:
2012
,
19
Before Mapper:
99
,
2012011023
======After Mapper:
2012
,
23
Before Mapper:
110
,
2001010116
======After Mapper:
2001
,
16
Before Mapper:
121
,
2001010212
======After Mapper:
2001
,
12
Before Mapper:
132
,
2001010310
======After Mapper:
2001
,
10
Before Mapper:
143
,
2001010411
======After Mapper:
2001
,
11
Before Mapper:
154
,
2001010529
======After Mapper:
2001
,
29
Before Mapper:
165
,
2013010619
======After Mapper:
2013
,
19
Before Mapper:
176
,
2013010722
======After Mapper:
2013
,
22
Before Mapper:
187
,
2013010812
======After Mapper:
2013
,
12
Before Mapper:
198
,
2013010929
======After Mapper:
2013
,
29
Before Mapper:
209
,
2013011023
======After Mapper:
2013
,
23
Before Mapper:
220
,
2008010105
======After Mapper:
2008
,
5
Before Mapper:
231
,
2008010216
======After Mapper:
2008
,
16
Before Mapper:
242
,
2008010337
======After Mapper:
2008
,
37
Before Mapper:
253
,
2008010414
======After Mapper:
2008
,
14
Before Mapper:
264
,
2008010516
======After Mapper:
2008
,
16
Before Mapper:
275
,
2007010619
======After Mapper:
2007
,
19
Before Mapper:
286
,
2007010712
======After Mapper:
2007
,
12
Before Mapper:
297
,
2007010812
======After Mapper:
2007
,
12
Before Mapper:
308
,
2007010999
======After Mapper:
2007
,
99
Before Mapper:
319
,
2007011023
======After Mapper:
2007
,
23
Before Mapper:
330
,
2010010114
======After Mapper:
2010
,
14
Before Mapper:
341
,
2010010216
======After Mapper:
2010
,
16
Before Mapper:
352
,
2010010317
======After Mapper:
2010
,
17
Before Mapper:
363
,
2010010410
======After Mapper:
2010
,
10
Before Mapper:
374
,
2010010506
======After Mapper:
2010
,
6
Before Mapper:
385
,
2015010649
======After Mapper:
2015
,
49
Before Mapper:
396
,
2015010722
======After Mapper:
2015
,
22
Before Mapper:
407
,
2015010812
======After Mapper:
2015
,
12
Before Mapper:
418
,
2015010999
======After Mapper:
2015
,
99
Before Mapper:
429
,
2015011023
======After Mapper:
2015
,
23
Finished
|
再来看看执行结果:
结果还有很多行,没有截图了。
由于没有执行Reduce操作,因此这个就是Mapper输出的中间文件的内容了。
从打印的日志可以看出:
-
Mapper的输出数据(k2, v2)格式是:经自己在Mapper中写出的格式:<2010, 25>,<2012, 23>...
从这个结果中可以看出,原数据文件中的每一行确实都有一行输出,那么Map过程的第3阶段就证实了。
从这个结果中还可以看出,“年份”已经不是输入给Mapper的顺序了,这也说明了在Map过程中也按照Key执行了排序操作,即Map过程的第5阶段。