MapReduce倒排索引简单实现

最新推荐文章于 2022-11-17 11:47:46 发布

qwurey

最新推荐文章于 2022-11-17 11:47:46 发布

阅读量3.5k

点赞数

分类专栏： Hadoop 文章标签：倒排索引 MapReduce

本文链接：https://blog.csdn.net/yeruby/article/details/40981561

版权

Hadoop 专栏收录该内容

21 篇文章 1 订阅

订阅专栏

倒排索引：倒排索引是文档检索系统中最常用的数据结构，被广泛的应用于全文搜索引擎。它主要用来存储某个单词（或词组），在一个文档或一组文档中的存储位置的映射，即提供了一种根据内容来查找文档的方式，由于不是根据文档来确定文档所包含的内容，而是进行了相反的操作，因而被称为倒排索引。

例如：

Input：输入有三个文件

news1 :

Hello, World! Hello, Urey!

news2 :

Hello, MapReduce!

news3 :

Hello, "Hadoop"!

Output:

<span style="font-size:18px;">Hadoop	news3:1,
Hello	news3:1,news1:2,news2:1,
MapReduce	news2:1,
Urey	news1:1,
World	news1:1,</span>

MapReduce实现：

Mapper Input：

<span style="font-size:18px;"><LongWritable,Text,Text,Text></span>

Mapper Output：

<span style="font-size:18px;">word:uri 1
Hello:news1 1</span>

Combiner Input:

<span style="font-size:18px;">word:uri 1
Hello:news1 1</span>

Combiner Output:

<span style="font-size:18px;">word uri:number
Hello news1:number</span>

Reducer Input：

<span style="font-size:18px;">word uri:number
Hello news1:number</span>

Reducer Output：

<span style="font-size:18px;">word uri1:number1,uri2:number2,...
Hello news1:number1,news2:number2,...</span>

源代码：

Mapper:

<span style="font-size:18px;">public class InvertedIndexMapper extends Mapper<LongWritable, Text, Text, Text> {

	Text outKey = new Text();
	Text outValue = new Text();
	Pattern pattern = Pattern.compile("[A-Za-z0-9]+"); 
	Matcher match;
	
	public void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		StringTokenizer tokens = new StringTokenizer(value.toString());  
        FileSplit split = (FileSplit) context.getInputSplit();  
        while(tokens.hasMoreTokens()) {  
            String token = tokens.nextToken();
            match = pattern.matcher(token);
    		if (match.find()) {
    			outKey.set(match.group() + ":" + split.getPath());  
    			outValue.set("1");  
    		} else {
    			return;
    		}
            try {  
                context.write(outKey, outValue);  
            } catch (IOException e) {  
                e.printStackTrace();  
            } catch (InterruptedException e) {  
                e.printStackTrace();  
            }  
        }  
	}
}</span>

Combiner:

<span style="font-size:18px;">public static class InvertedIndexCombiner extends Reducer<Text,Text,Text,Text> {
		
		private Text outKey = new Text();
		private Text outValue = new Text();
		
		@Override
		public void reduce(Text key, Iterable<Text> values, Context context) {
			int sum = 0;
			for(Text value : values) {
				sum += Integer.parseInt(value.toString());
			}
			String keys[] = key.toString().split(":");
			outKey.set(keys[0]);
			int index = keys[keys.length-1].lastIndexOf('/');
			outValue.set(keys[keys.length-1].substring(index+1)+":"+sum);
			try {
				context.write(outKey, outValue);
			} catch (IOException e) {
				e.printStackTrace();
			} catch (InterruptedException e) {
				e.printStackTrace();
			}
		}
	}</span>

Reducer:

<span style="font-size:18px;">public class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {

	public void reduce(Text key, Iterable<Text> values, Context context)
			throws IOException, InterruptedException {
		
		StringBuffer string = new StringBuffer();
		// process values
		for (Text val : values) {
			string.append(val+",");
		}
		context.write(key, new Text(string.toString()));
	}
}</span>

倒排索引基础知识见文章：《搜索引擎-倒排索引基础知识》