一:题目 莎士比亚文档倒排索引
二:简单的实现
1)map类 这其中定义一下map类的输出格式
public static class InvertedMapper extends Mapper<Long,Text,Text,Text>{
//默认的这里不是longWritable的key么,怎么回事,应该要设置把
@Override
protected void map(Long key, Text value, Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
Text one = new Text("1");
FileSplit fs = (FileSplit) context.getInputSplit();
String filename = fs.getPath().getName();
Text word = new Text();
StringTokenizer token = new StringTokenizer(value.toString());
while(token.hasMoreTokens()){
word.set(token.nextToken()+":"+filename);
context.write(word, one);//格式为<word:file> one
}
}
}
2)Combiner类
这里比较糊涂,combiner自己有接口,为什么要继承reducer。
//combine阶段,还是继承reducer
public static class InvertedCombiner extends Reducer<Text,Text,Text,Text>{
@Override
protected void reduce(Text key, Iterable<Text> values,Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
String keys[] = key.toString().split(":");
int sum = 0;
for(Text val:values){
sum+=Integer.parseInt(val.toString());
}
context.write(new Text(keys[0]), new Text(keys[1]+":"+String.valueOf(sum)));
//变为了word:<filename,sum>
}
}
3)
//定制partitioner,确保相同的term会分到同一个reducer
public static class InvertedPartioner extends HashPartitioner<Text, Text>{
@Override
public int getPartition(Text key, Text value, int numReduceTasks) {
// TODO Auto-generated method stub
String term = key.toString().split(":")[0];
return super.getPartition(new Text(term), value, numReduceTasks);
}
}
4)reduce类
public static class InvertedReducer extends Reducer<Text,Text,Text,Text>{
@Override
protected void reduce(Text key, Iterable<Text> values,Context context)
throws IOException, InterruptedException {
// TODO Auto-generated method stub
Iterator<Text> it = values.iterator();
StringBuilder sb = new StringBuilder();
if(it.hasNext())sb.append(it.next().toString());
while(it.hasNext()){
sb.append(";");
sb.append(it.next().toString());
}
context.write(key, new Text(sb.toString()));
}
}
三:感觉需要弄清楚map输出,和reduce输入的格式。