倒排索引:倒排索引是文档检索系统中最常用的数据结构,被广泛的应用于全文搜索引擎。它主要用来存储某个单词(或词组),在一个文档或一组文档中的存储位置的映射,即提供了一种根据内容来查找文档的方式,由于不是根据文档来确定文档所包含的内容,而是进行了相反的操作,因而被称为倒排索引。
例如:
Input:输入有三个文件
news1 :
Hello, World! Hello, Urey!
news2 :
Hello, MapReduce!
news3 :
Hello, "Hadoop"!
Output:
<span style="font-size:18px;">Hadoop news3:1,
Hello news3:1,news1:2,news2:1,
MapReduce news2:1,
Urey news1:1,
World news1:1,</span>
MapReduce实现:
Mapper Input:
<span style="font-size:18px;"><LongWritable,Text,Text,Text></span>
Mapper Output:
<span style="font-size:18px;">word:uri 1
Hello:news1 1</span>
Combiner Input:
<span style="font-size:18px;">word:uri 1
Hello:news1 1</span>
Combiner Output:
<span style="font-size:18px;">word uri:number
Hello news1:number</span>
Reducer Input:
<span style="font-size:18px;">word uri:number
Hello news1:number</span>
Reducer Output:
<span style="font-size:18px;">word uri1:number1,uri2:number2,...
Hello news1:number1,news2:number2,...</span>
源代码:
Mapper:
<span style="font-size:18px;">public class InvertedIndexMapper extends Mapper<LongWritable, Text, Text, Text> {
Text outKey = new Text();
Text outValue = new Text();
Pattern pattern = Pattern.compile("[A-Za-z0-9]+");
Matcher match;
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer tokens = new StringTokenizer(value.toString());
FileSplit split = (FileSplit) context.getInputSplit();
while(tokens.hasMoreTokens()) {
String token = tokens.nextToken();
match = pattern.matcher(token);
if (match.find()) {
outKey.set(match.group() + ":" + split.getPath());
outValue.set("1");
} else {
return;
}
try {
context.write(outKey, outValue);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}
}</span>
Combiner:
<span style="font-size:18px;">public static class InvertedIndexCombiner extends Reducer<Text,Text,Text,Text> {
private Text outKey = new Text();
private Text outValue = new Text();
@Override
public void reduce(Text key, Iterable<Text> values, Context context) {
int sum = 0;
for(Text value : values) {
sum += Integer.parseInt(value.toString());
}
String keys[] = key.toString().split(":");
outKey.set(keys[0]);
int index = keys[keys.length-1].lastIndexOf('/');
outValue.set(keys[keys.length-1].substring(index+1)+":"+sum);
try {
context.write(outKey, outValue);
} catch (IOException e) {
e.printStackTrace();
} catch (InterruptedException e) {
e.printStackTrace();
}
}
}</span>
Reducer:
<span style="font-size:18px;">public class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
StringBuffer string = new StringBuffer();
// process values
for (Text val : values) {
string.append(val+",");
}
context.write(key, new Text(string.toString()));
}
}</span>
倒排索引基础知识见文章:《搜索引擎-倒排索引基础知识》