MapReduce倒排索引简单实现

倒排索引:倒排索引是文档检索系统中最常用的数据结构,被广泛的应用于全文搜索引擎。它主要用来存储某个单词(或词组),在一个文档或一组文档中的存储位置的映射,即提供了一种根据内容来查找文档的方式,由于不是根据文档来确定文档所包含的内容,而是进行了相反的操作,因而被称为倒排索引。

例如:

Input:输入有三个文件

news1 :

Hello, World! Hello, Urey!

news2 :

Hello, MapReduce!

news3 :

Hello, "Hadoop"!


Output:

<span style="font-size:18px;">Hadoop	news3:1,
Hello	news3:1,news1:2,news2:1,
MapReduce	news2:1,
Urey	news1:1,
World	news1:1,</span>

MapReduce实现:

Mapper Input:

<span style="font-size:18px;"><LongWritable,Text,Text,Text></span>

Mapper Output:

<span style="font-size:18px;">word:uri 1
Hello:news1 1</span>


Combiner Input:

<span style="font-size:18px;">word:uri 1
Hello:news1 1</span>

Combiner Output:

<span style="font-size:18px;">word uri:number
Hello news1:number</span>

Reducer Input:

<span style="font-size:18px;">word uri:number
Hello news1:number</span>

Reducer Output:

<span style="font-size:18px;">word uri1:number1,uri2:number2,...
Hello news1:number1,news2:number2,...</span>


源代码:

Mapper:

<span style="font-size:18px;">public class InvertedIndexMapper extends Mapper<LongWritable, Text, Text, Text> {

	Text outKey = new Text();
	Text outValue = new Text();
	Pattern pattern = Pattern.compile("[A-Za-z0-9]+"); 
	Matcher match;
	
	public void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		StringTokenizer tokens = new StringTokenizer(value.toString());  
        FileSplit split = (FileSplit) context.getInputSplit();  
        while(tokens.hasMoreTokens()) {  
            String token = tokens.nextToken();
            match = pattern.matcher(token);
    		if (match.find()) {
    			outKey.set(match.group() + ":" + split.getPath());  
    			outValue.set("1");  
    		} else {
    			return;
    		}
            try {  
                context.write(outKey, outValue);  
            } catch (IOException e) {  
                e.printStackTrace();  
            } catch (InterruptedException e) {  
                e.printStackTrace();  
            }  
        }  
	}
}</span>

Combiner:

<span style="font-size:18px;">public static class InvertedIndexCombiner extends Reducer<Text,Text,Text,Text> {
		
		private Text outKey = new Text();
		private Text outValue = new Text();
		
		@Override
		public void reduce(Text key, Iterable<Text> values, Context context) {
			int sum = 0;
			for(Text value : values) {
				sum += Integer.parseInt(value.toString());
			}
			String keys[] = key.toString().split(":");
			outKey.set(keys[0]);
			int index = keys[keys.length-1].lastIndexOf('/');
			outValue.set(keys[keys.length-1].substring(index+1)+":"+sum);
			try {
				context.write(outKey, outValue);
			} catch (IOException e) {
				e.printStackTrace();
			} catch (InterruptedException e) {
				e.printStackTrace();
			}
		}
	}</span>

Reducer:

<span style="font-size:18px;">public class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {

	public void reduce(Text key, Iterable<Text> values, Context context)
			throws IOException, InterruptedException {
		
		StringBuffer string = new StringBuffer();
		// process values
		for (Text val : values) {
			string.append(val+",");
		}
		context.write(key, new Text(string.toString()));
	}
}</span>





倒排索引基础知识见文章:《搜索引擎-倒排索引基础知识》


  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值