倒排索引

最新推荐文章于 2024-01-28 17:33:15 发布

jsperlee

最新推荐文章于 2024-01-28 17:33:15 发布

阅读量218

点赞数

分类专栏：倒排索引

本文链接：https://blog.csdn.net/qq_27347421/article/details/104166621

版权

倒排索引专栏收录该内容

1 篇文章 0 订阅

订阅专栏

倒排索引概念

倒排索引（英语：Inverted index），也常被称为反向索引、置入档案或反向档案，是一种索引方法。应用全文搜索—搜索引擎

正向索引：

多个文件中每一个文件中包含的关键词，以及关键词所在的位置次数做的索引
1.txt hello,1,0 hello,1,17
2.txt hello,3,5
3.txt spark
4.txt hbase
文件中包含了哪些关键词这些关键词的位置通过文件名找关键字好找 h
ello出现在哪些文件中需要获取每一个文件的正向索引循环遍历看其中是否包含这个关键字
正常的搜索引擎中：输入的关键字找包含这个关键字的文件

反向索引：

以关键词做索引这个关键词在哪些文件中出现过
hello 1.txt,3,1,4,6 2.txt,1,3 spark
3.txt,2,1,5 4.txt,1,2
搜索引擎：输入华为返回了很对网页，包含华为这个关键字的，华为 www.huawei.com,0 www.baidubaike.com,2
这种的索引方式就叫做倒排索引便于指定关键字的全文搜索

案例：

创建倒排索引的统计每个关键词在每个文档中当中的第几行出现了多少次
每个关键字在每个文件中出现的位置以及次数
分组：关键字
map:
key:关键字
value：文件名，位置(偏移量—字节偏移----LongWritable)，次数
reduce端：
相同的关键词的文件相关信息全部到reduce端了，进行拼接

package com.lee.invertedindex;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class InvertedIndex {
	/*
	 * map:
	key:关键字
	value：文件名，位置(偏贵量--LongWritable)，次数
	reduce端：
	相同的关键词的文件相关信息全部到reduce端了，进行拼接
	 * */
	static class myMapper extends Mapper<LongWritable, Text, Text, Text>{
		//获取文件名
		String filename;
		Text mk=new Text();
		Text mv=new Text();
		@Override
		protected void setup(Mapper<LongWritable, Text, Text, Text>.Context context)
				throws IOException, InterruptedException {
			FileSplit split = (FileSplit)context.getInputSplit();
			filename= split.getPath().getName();
		}
		//重写map
		@Override
		protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
				throws IOException, InterruptedException {
			String[] keywords = value.toString().split("\t");
			//统计次数 map
			Map<String,Integer> map=new HashMap<String,Integer>();
			for (String k : keywords) {
				if(map.containsKey(k)) {
					map.put(k, map.get(k)+1);
				}
				else {
					map.put(k, 1);
				}
			}
			//循环遍历map发送
			Set<String> keySet = map.keySet();
			for (String k : keySet) {
				mk.set(k);
				//文件名:偏移量,出现的次数;
				mv.set(filename+":"+key.get()+","+map.get(k)+";");
				context.write(mk, mv);
			}
		}
	}
	static class myReduce extends Reducer<Text, Text, Text, Text>{
		Text rv=new Text();
		@Override
		protected void reduce(Text key, Iterable<Text> values, 
				Context context)
				throws IOException, InterruptedException {
			//循环遍历values
			StringBuffer sb = new StringBuffer();
			for (Text v : values) {
				 sb.append(v.toString());
			}
			rv.set(sb.substring(0,sb.length()-1));
			context.write(key, rv);
		}
	}
	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		Configuration conf = new Configuration();
		
		Job job=Job.getInstance(conf);
		job.setJarByClass(InvertedIndex.class);
		
		job.setMapperClass(myMapper.class);
		job.setReducerClass(myReduce.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		
		FileInputFormat.addInputPath(job, new Path("F:\\Invertedindex"));
		
		//只要有reduce类，就会有输出路径
		//输出路径的作用：存放输出标志文件_success
		FileOutputFormat.setOutputPath(job, new Path("F:\\Invertedindex_out"));
		
		boolean waitForCompletion = job.waitForCompletion(true);
		System.exit(waitForCompletion?0:1);
	}
}

jsperlee

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
倒排索引

倒排索引概念倒排索引（英语：Inverted index），也常被称为反向索引、置入档案或反向档案，是一种索引方法，被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。它是文档检索系统中最常用的数据结构。通过倒排索引，可以根据单词快速获取包含这个单词的文档列表。倒排索引主要由两个部分组成：“单词词典”和“倒排文件”。　倒排索引倒排索引有两种不同的反向索引形式：　　一条...
复制链接

扫一扫

专栏目录