-
本文发表于本人博客。
今天继续写练习题,上次对分区稍微理解了一下,那根据那个步骤分区、排序、分组、规约来的话,今天应该是要写个排序有关的例子了,那好现在就开始!
说到排序我们可以查看下hadoop源码里面的WordCount例子中对LongWritable类型定义,它实现抽象接口WritableComparable,代码如下:
1.
public
interface
WritableComparable<T>
extends
Writable, Comparable<T> {
2.
}
3.
public
interface
Writable {
4.
void
write(DataOutput out)
throws
IOException;
5.
void
readFields(DataInput in)
throws
IOException;
6.
}
其中Writable抽象接口定义了write以及readFields方法,分别是写入数据流以及读取数据流。而Comparable中又有compareTo方法定义比较。竟然hadoop的内置类型有比较大小功能,那么它使用这个内置类型作为map端输出的话是怎么样去排序的,这个问题我们先来查看下map任务类MapTask源代码,内部有内置MapOutputBuffer类,在spill accounting注释下面有个排序字段:
1.
private
final
IndexedSorter sorter;
这个字段是由:
1.
sorter = ReflectionUtils.newInstance(job.getClass(
'map.sort.class'
, QuickSort.
class
, IndexedSorter.
class
), job);
可以看出,这个排序算法可以在配置文件中指定,不过默认是快速排序QuickSort。这个QuickSort内部有几个重要的方法:
1.
public
void
sort(
final
IndexedSortable s,
int
p,
int
r,
final
Progressable rep);
2.
private
static
void
sortInternal(
final
IndexedSortable s,
int
p,
int
r,
final
Progressable rep,
int
depth);
其中在传递参数IndexSortable的时候是用MapOutputBuffer当前来传递,因为这个MapOutputBuffer也继承IndexedSortable.这样在QuickSort排序sort中就会使用MapOutputBuffer类中的compare方法进行比较,可以看下面源代码:
01.
public
int
compare(
int
i,
int
j) {
02.
final
int
ii = kvoffsets[i % kvoffsets.length];
03.
final
int
ij = kvoffsets[j % kvoffsets.length];
04.
// sort by partition
05.
if
(kvindices[ii + PARTITION] != kvindices[ij + PARTITION]) {
06.
return
kvindices[ii + PARTITION] - kvindices[ij + PARTITION];
07.
}
08.
// sort by key
09.
return
comparator.compare(kvbuffer,
10.
kvindices[ii + KEYSTART],
11.
kvindices[ii + VALSTART] - kvindices[ii + KEYSTART],
12.
kvbuffer,
13.
kvindices[ij + KEYSTART],
14.
kvindices[ij + VALSTART] - kvindices[ij + KEYSTART]);
15.
}
然而这个方法中comparator默认是由节点“mapred.output.key.comparator.class”决定,也可以看源码:
1.
public
RawComparator getOutputKeyComparator() {
2.
Class<?
extends
RawComparator> theClass = getClass(
'mapred.output.key.comparator.class'
,
3.
null
, RawComparator.
class
);
4.
if
(theClass !=
null
)
5.
return
ReflectionUtils.newInstance(theClass,
this
);
6.
return
WritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.
class
));
7.
}
就是这样把排序以及比较方法关联起来了!那现在我们可以按照LongWritable的思路实现自己的自定义类型并且读取、写入、比较。下面写写代码加深下记忆,既然是排序那我们准备下数据,如下有2列数据要求按照第一列升序,第二列降序排序:
1.
1
2
2.
1
1
3.
3
0
4.
3
2
5.
2
2
6.
1
2
先自定义类型SortAPI:
01.
public
class
SortAPI
implements
WritableComparable<SortAPI> {
02.
/**
03.
* 第一列数据
04.
*/
05.
public
Long first;
06.
/**
07.
* 第二列数据
08.
*/
09.
public
Long second;
10.
11.
public
SortAPI(){}
12.
public
SortAPI(
long
first,
long
second){
13.
this
.first = first;
14.
this
.second = second;
15.
}
16.
/**
17.
* 排序就在这里当:this.first - o.first > 0 升序,小于0倒序
18.
*/
19.
@Override
20.
public
int
compareTo(SortAPI o) {
21.
long
mis = (
this
.first - o.first);
22.
if
(mis !=
0
){
23.
return
(
int
)mis;
24.
}
25.
else
{
26.
return
(
int
)(
this
.second - o.second);
27.
}
28.
}
29.
30.
@Override
31.
public
void
write(DataOutput out)
throws
IOException {
32.
out.writeLong(first);
33.
out.writeLong(second);
34.
}
35.
36.
@Override
37.
public
void
readFields(DataInput in)
throws
IOException {
38.
this
.first = in.readLong();
39.
this
.second = in.readLong();
40.
41.
}
42.
43.
@Override
44.
public
int
hashCode() {
45.
return
this
.first.hashCode() +
this
.second.hashCode();
46.
}
47.
48.
@Override
49.
public
boolean
equals(Object obj) {
50.
if
(obj
instanceof
SortAPI){
51.
SortAPI o = (SortAPI)obj;
52.
return
this
.first == o.first &&
this
.second == o.second;
53.
}
54.
return
false
;
55.
}
56.
@Override
57.
public
String toString() {
58.
return
'first:'
+
this
.first +
'second:'
+
this
.second;
59.
}
60.
}
这类型重写compareTo(SortAPI o)、write(DataOutput out)、readFields(DataInput in),既然是有比较那么以前说的就一定要重写hashCode()、equals(Object obj)方法了,这点不要忘记!还需要主要在write方法以及readFields方法中读写是有顺序:先write什么字段就先read什么字段。其次这个compareTo(SortAPI o)方法中返回是整型大于0、0、以及小于0代表大于、等于、小于。至于怎么判断2行数据是不是相等,不相等怎么比较着逻辑可以慢慢看下。
下面写个自定义Mapper、Reducer类以及main函数:
01.
public
class
MyMapper
extends
Mapper<LongWritable, Text, SortAPI, LongWritable> {
02.
03.
@Override
04.
protected
void
map(LongWritable key, Text value,Context context)
throws
IOException, InterruptedException {
05.
String[] splied = value.toString().split(
' '
);
06.
try
{
07.
long
first = Long.parseLong(splied[
0
]);
08.
long
second = Long.parseLong(splied[
1
]);
09.
context.write(
new
SortAPI(first,second),
new
LongWritable(
1
));
10.
}
catch
(Exception e) {
11.
System.out.println(e.getMessage());
12.
}
13.
}
14.
}
1.
public
class
MyReduce
extends
Reducer<SortAPI, LongWritable, LongWritable, LongWritable> {
2.
3.
@Override
4.
protected
void
reduce(SortAPI key, Iterable<LongWritable> values, Context context)
throws
IOException, InterruptedException {
5.
context.write(
new
LongWritable(key.first),
new
LongWritable(key.second));
6.
}
7.
8.
}
01.
static
final
String OUTPUT_DIR =
'hdfs://hadoop-master:9000/sort/output/'
;
02.
static
final
String INPUT_DIR =
'hdfs://hadoop-master:9000/sort/input/test.txt'
;
03.
04.
public
static
void
main(String[] args)
throws
Exception {
05.
Configuration conf =
new
Configuration();
06.
Job job =
new
Job(conf, Test.
class
.getSimpleName());
07.
deleteOutputFile(OUTPUT_DIR);
08.
09.
//1设置输入目录
10.
FileInputFormat.setInputPaths(job, INPUT_DIR);
11.
//2设置输入格式化类
12.
job.setInputFormatClass(TextInputFormat.
class
);
13.
//3设置自定义Mapper以及键值类型
14.
job.setMapperClass(MyMapper.
class
);
15.
job.setMapOutputKeyClass(SortAPI.
class
);
16.
job.setMapOutputValueClass(LongWritable.
class
);
17.
//4分区
18.
job.setPartitionerClass(HashPartitioner.
class
);
19.
job.setNumReduceTasks(
1
);
20.
21.
//5排序分组
22.
//6设置在一定Reduce以及键值类型
23.
job.setReducerClass(MyReduce.
class
);
24.
job.setOutputKeyClass(LongWritable.
class
);
25.
job.setOutputValueClass(LongWritable.
class
);
26.
//7设置输出目录
27.
FileOutputFormat.setOutputPath(job,
new
Path(OUTPUT_DIR));
28.
//8提交job
29.
job.waitForCompletion(
true
);
30.
}
31.
32.
static
void
deleteOutputFile(String path)
throws
Exception{
33.
Configuration conf =
new
Configuration();
34.
FileSystem fs = FileSystem.get(
new
URI(INPUT_DIR),conf);
35.
if
(fs.exists(
new
Path(path))){
36.
fs.delete(
new
Path(path));
37.
}
38.
}
这样在eclipse下就可以直接运行查看结果:
1.
1
1
2.
1
2
3.
2
2
4.
3
0
5.
3
2
这结果正确,那如果要求第一列倒叙第二列升序呢,怎么办,这只需要修改下compareTo(SortAPI o):
01.
@Override
02.
public
int
compareTo(SortAPI o) {
03.
long
mis = (
this
.first - o.first) * -
1
;
04.
if
(mis !=
0
){
05.
return
(
int
)mis;
06.
}
07.
else
{
08.
return
(
int
)(
this
.second - o.second);
09.
}
10.
}
这样保存在运行,结果:
1.
3
0
2.
3
2
3.
2
2
4.
1
1
5.
1
2
也正确吧符合自己的这个要求。
留个小问题:这个compareTo(SortAPI o)方法在什么时候调用了,总共调用了几次?
这次先到这里。坚持记录点点滴滴!
Hadoop mapreduce自定义排序WritableComparable
最新推荐文章于 2024-04-30 00:18:16 发布