1 算法机理
请参考博文:
Data-Intensive Text Processing with MapReduce第三章(4)-SECONDARY SORTING
2 程序实现
2.1 数据准备
文中讲解使用的示例数据为传感器数据,这里为了程序实验,我利用python写了自动生成相关数据的脚本,代码如下:
import random
text = ""
for i in range(0, 10000):
sensor_id = "m" + str(random.randint(1, 200))
timestamp = str(random.randint(1, 366))
value = str(random.randint(-100, 100))
text += (sensor_id + "," + timestamp + ":" + value + "\n")
fs = open("sensor_records.txt", "w")
fs.write(text)
生成的数据格式如下:
每一行的第一个单词代表传感器编号,第二个为时间戳(这里是用的一年的第几天),最后一项为传感器的值。
m91,235:-95
m174,355:-77
m87,227:-66
m99,221:91
m172,69:92
m45,55:66
m104,194:87
m57,6:26
2.2 程序实现
2.2.1 实现组合键
实现组合键的思路同上一节讲的TextPair类似,基本思路就是重写一个定制的WritableComparable,这里说明下,我这里的compareTo函数里面是按照字符顺序对各个值进行比较的,所有底下的结果你可能会发现时间戳怎么没有从小到大排,其实已经排序了,只是按照的是字符顺序,如果想使用整形排序,则在compareTo函数里将类型转为整形再进行比较即可。
代码如下:
package mp.secondsorting;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
/**
* 实现定制的组合键
* @author liupengeh
*
*/
public class CompositeWritable implements WritableComparable<CompositeWritable>{
private Text sensor_id;
private Text timestamp;
public CompositeWritable(){ //该构造函数一定要加上,否则作业运行时会报错:Error: java.io.IOException: Unable to initialize any output collector
set(new Text(), new Text());
}
public CompositeWritable(String p1, String p2){
set(new Text(p1), new Text(p2));
}
public CompositeWritable(Text p1, Text p2) {
set(p1, p2);
}
public Text getSensor_id() {
return sensor_id;
}
public void setSensor_id(Text sensor_id) {
this.sensor_id = sensor_id;
}
public Text getTimestamp() {
return timestamp;
}
public void setTimestamp(Text timestamp) {
this.timestamp = timestamp;
}
public void set(Text p1, Text p2) {
this.setSensor_id(p1);
this.setTimestamp(p2);
}
@Override
public void readFields(DataInput arg0) throws IOException {
// TODO Auto-generated method stub
this.sensor_id.readFields(arg0);
this.timestamp.readFields(arg0);
}
@Override
public void write(DataOutput arg0) throws IOException {
// TODO Auto-generated method stub
this.sensor_id.write(arg0);
this.timestamp.write(arg0);
}
@Override
public int compareTo(CompositeWritable o) {
// TODO Auto-generated method stub
int cmp = this.sensor_id.compareTo(o.sensor_id);
if(cmp != 0) {
return cmp;
}
return this.timestamp.compareTo(o.timestamp);
}
@Override
public int hashCode(){
return this.sensor_id.hashCode() * 163 + this.timestamp.hashCode();
}
@Override
public boolean equals(Object o){
if(o instanceof CompositeWritable) {
CompositeWritable cw = (CompositeWritable) o;
return this.sensor_id.equals(cw.sensor_id) && this.timestamp.equals(cw.timestamp);
}
return false;
}
@Override
public String toString() {
return this.sensor_id.toString() + "," + this.timestamp.toString();
}
}
2.2.2 实现Mapper
Mapper的思路很简单,即遇到每一行,将sensor_id,timestamp以及value取出来后发送出去即可。代码如下:
public static class SecondSortingMapper extends Mapper<LongWritable, Text, CompositeWritable, Text> {
/**
* 二次排序的mapper
*/
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
//利用字符串分割取出senser_id, timestamp以及对应的值,这里可以根据文本的特点来具体看怎么划分,我这里的格式为t,m:v
String[] lineSegments = line.split(":");
String realValue = lineSegments[1];
String sensorId = lineSegments[0].split(",")[0];
String timestamp = lineSegments[0].split(",")[1];
CompositeWritable cw = new CompositeWritable(sensorId, timestamp);
context.write(cw, new Text(realValue));
}
}
2.2.3 实现定制的partitioner
为了将同一个sensor_id对应不同时间戳的值发送到同一个reducer上,我们需要一个定制的partitioner,该partitioner只根据组合键中的sensor_id的hashCode来划分reducer,代码如下:
public static class SecondSortingPartitioner extends Partitioner<CompositeWritable, Text> {
/**
* 自定义partitioner,使得按照sensor_id划分
*/
@Override
public int getPartition(CompositeWritable arg0, Text arg1, int arg2) {
// TODO Auto-generated method stub
return arg0.getSensor_id().hashCode() % arg2;
}
}
2.2.4 实现定制的GroupComparator
尽管定制的partitioner可以保证将相同sensor_id的所有时间的戳的值发送到同一个reducer,但当最后输出到文件中会发现,虽然结果按照时间戳排序了,但每个结果占一行,假如我们希望一个sensor_id对应一行结果,而结果也是按照时间戳排序的,这时候就需要加入一个GroupComparator,代码如下:
public static class SensorTimestampGroupingComparator extends WritableComparator {
/**
* 尽管定制的partitioner可以保证同一个sensor_id对应的不同时间戳的数据可以分到同一个reducer上,
* 但我们仍需要考虑使reducer根据真实的“key”划分组
* 加入这个的效果如下:
* 如果不加入该比较器,得到的结果格式如下:
* m99,336 -93,
* m99,348 52,
* m99,354 70,
* m99,357 97,
* m99,64 -22,
* m99,67 41,
* m99,74 -62,
* 加入该比较器后结果可以显示在一行:
* m99,* -93,52,70,97,-22,41,-62,
*/
public SensorTimestampGroupingComparator() {
super(CompositeWritable.class, true);
}
@Override
public int compare(WritableComparable w1, WritableComparable w2) {
CompositeWritable c1 = (CompositeWritable) w1;
CompositeWritable c2 = (CompositeWritable) w2;
return c1.getSensor_id().compareTo(c2.getSensor_id());
}
}
2.2.5 实现reducer
前面的都写好后,reducer的实现就简单很多了,但这里有一个点需要注意,如果想保留时间戳,那么就需要结果按行输出,就不需要上面所提的GroupComparator了;如果想让每个传感器对应的结果自成一行,则需要使用GroupComparator。代码如下:
public static class SecondSortingReducer extends Reducer<CompositeWritable, Text, CompositeWritable, Text> {
/**
* 二次排序的reducer
*/
public void reduce(CompositeWritable key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
String res = "";
for(Text t: values){
res += (t.toString() + ",");
}
//如果想保留时间戳,将结果分行输出
//context.write(key, new Text(res));
//如果想将结果输出到同一行,则需要使用下面的SensorTimestampGroupingComparator,并且将时间戳改为通配符“*”
context.write(new CompositeWritable(key.getSensor_id(), new Text("*")), new Text(res));
}
}
2.3 验证结果
2.3.1 查看原始数据
cat datas/second_sorting/sensor_records.txt
可以看到如下,无论传感器编号还是时间戳都是乱序的。
...
m200,257:-19
m166,147:77
m28,266:-9
m53,91:-69
m179,296:76
m168,209:-92
m62,358:78
m135,284:30
m32,235:-79
...
2.3.2 不使用GroupComparator的结果
hadoop fs -cat /lph/second_sorting/result/part-r-00000
可以看到传感器编号排序,时间戳也排序了,但每个结果独占一行。
...
m98,44 24,
m98,47 26,
m98,48 44,
m98,51 2,
m98,60 79,
m98,65 -16,
m98,69 -71,
m98,79 40,
m98,90 -95,
m99,1 -36,35,
m99,106 86,
m99,110 -68,
m99,118 -91,
m99,133 5,
m99,139 92,
m99,140 3,-76,
m99,144 -54,
m99,153 -87,
...
2.3.3 使用GroupComparator结果
hadoop fs -cat /lph/second_sorting/result3/part-r-00000 | grep m9*
可以从以下结果看出,传感器编号排序了,并且结果显示在一行,按照时间戳的顺序。
m98,* 62,-58,28,-62,53,-87,45,43,-79,-89,-37,-68,89,86,31,-56,-79,99,22,84,-21,6,-37,16,95,-78,20,-80,-60,10,2,37,75,-86,6,63,2,-81,15,66,-50,2,43,66,-49,24,26,44,2,79,-16,-71,40,-95,
m99,* -36,35,86,-68,-91,5,92,3,-76,-54,-87,70,-94,72,-97,-89,95,31,82,40,98,-18,16,-77,73,-74,44,91,-21,-18,-16,-58,-25,54,30,53,-70,-66,-93,52,70,97,-22,41,-62,-35,
3 总结
使用值(这里把时间戳当成值,你也可以使用后面的值)进行二次排序在程序编写中尽管不是必须,但掌握这种技能总是有好处的,谢谢你的浏览。
源代码github地址:
MapReduce算法设计