倒排索引:根据单词来查找文档
实现:
单词1 文档1:次数,文档2:次数,文档5:次数
单词1 平均次数
单词2 文档3:次数,文档6:次数
单词2 平均次数
Mapper:
输出: key: term- ->docid
value: 1
public static class Mapper1 extends Mapper<LongWritable, Text, Text, LongWritable> {
private Text outKey = new Text();
private LongWritable outValue = new LongWritable();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
String [] values = value.toString().split("\\s+");
FileSplit inputSplit = (FileSplit) context.getInputSplit();
String fileName = inputSplit.getPath().getName();
int index = fileName.lastIndexOf(".");
fileName = fileName.substring(0,index);
for(String each: values){
outKey.set(each+"-->"+fileName);
outValue.set(1);
context.write(outKey,outValue);
}
}
}
Combiner:
局部汇总,减少网络传输压力
public static class Combiner1 extends Reducer<Text, LongWritable, Text, LongWritable>{
public void reduce(Text key,Iterable<LongWritable> values, Context context) throws IOException, InterruptedException{
long cnt = 0;
for(LongWritable value: values){
cnt += value.get();
}
context.write(key, new LongWritable(cnt));
}
}
Partitioner:
key : term – > docid
需要让相同term的数据都在一个reducer,因此需要自定义partitioner
注意:传入reducer的key仍是term- ->docid, 排序也是根据term- ->docid的
public static class Partitioner1 extends HashPartitioner<Text,LongWritable> {
public int getPartition(Text key, LongWritable value, int numReduceTasks){
String [] tokens = key.toString().split("-->");
return