MapReduce关联性操作（三）

最新推荐文章于 2021-06-17 17:18:20 发布

xxiaoMinGLL

最新推荐文章于 2021-06-17 17:18:20 发布

阅读量318

点赞数

文章标签： MapReduce 倒排索引

本文链接：https://blog.csdn.net/xxiaoMinGLL/article/details/79237435

版权

倒排索引

"倒排索引"是文档检索系统中最常用的数据结构，被广泛地应用于全文搜索引擎。它主要是用来存储某个单词（或词组）在一个文档或一组文档中的存储位置的映射，即提供了一种根据内容来查找文档的方式。由于不是根据文档来确定文档所包含的内容，而是进行相反的操作，因而称为倒排索引（Inverted Index）。

实例描述：

通常情况下，倒排索引由一个单词（或词组）以及相关的文档列表组成，文档列表中的文档或者是标识文档的ID号，或者是指文档所在位置的URL
在实际应用中，还需要给每个文档添加一个权值，用来指出每个文档与搜索内容的相关度。

样例输入：

1）file1：
MapReduce is simple
2）file2：
MapReduce is powerful is simple
3）file3：
Hello MapReduce bye MapReduce

期望输出：

MapReduce file1.txt:1;file2.txt:1;file3.txt:2;
is 　　　　file1.txt:1;file2.txt:2;
simple 　 file1.txt:1;file2.txt:1;
powerful 　　 file2.txt:1;
Hello 　　 file3.txt:1;
bye 　　 file3.txt:1;

package mapreduce;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import mapreduce.sort.MyMapper;
import mapreduce.sort.MyReduce;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class CopyOfSST {
	static String INPUT_PATH="hdfs://master:9000/show1/";
	static String OUTPUT_PATH="hdfs://master:9000/output";
    static int len1=0;
    static int len2=0;
    static int len3=0;
	static class MyMapper extends Mapper<Object, Object, Text, Text>{
		Text output_key=new Text();
		Text output_value=new Text();
	    String tableName="";
	    //判断文件名称
	    protected void setup(Context context) throws java.io.IOException, java.lang.InterruptedException{
			FileSplit  fs= (FileSplit) context.getInputSplit();
			tableName=fs.getPath().getName();
//			System.out.println(tableName);  
		 }
		protected void map(Object key,Object value,Context context) throws IOException, InterruptedException{
			
   
			String[] str=value.toString().split(" ");
			
				for(int i=0;i<str.length;i++){
					output_key.set(str[i]+":"+tableName);
					output_value.set("1");
					context.write(output_key,output_value);//(word:tableName,1)
				}
				
			
			
		}
	}
           
//	
//
	static class MyCombiner extends Reducer<Text, Text,Text, Text>{
		
		//Text outputkey=new Text();
		Text outputvalue=new Text();
		protected void reduce(Text key ,Iterable<Text> values, Context context) throws IOException, InterruptedException{
			
			int count=0;
			for(Text c:values){
				count+=Integer.parseInt(c.toString());
			}                  //统计word
			String[] str=key.toString().split(":");
			key.set(str[0]);
			outputvalue.set(str[1] + ":" + count);

			context.write(key, outputvalue);   //(word,tableName:count)
			
			
	}
	}
	
	static class MyReduce extends Reducer<Text, Text, Text, Text>{
//		 Text outputkey=new Text();
		 Text outputvalue=new Text();
		
		 protected void reduce(Text key, Iterable<Text> values, Context context) 
				 throws IOException, InterruptedException{
			 
			String filelist=new String();
			for(Text value:values){
				filelist+=value.toString()+";";
			}
			outputvalue.set(filelist);
			context.write(key,outputvalue); //(word,    tableName:count;tableName:count;)
	}
	}
	
	public static void main(String[] args) throws Exception{
		Path outputpath = new Path(OUTPUT_PATH);
		Configuration conf = new Configuration();
		
		FileSystem file = outputpath.getFileSystem(conf);
		if(file.exists(outputpath)){
			file.delete(outputpath,true);
		}
		
		Job job = Job.getInstance(conf);
		
		FileInputFormat.setInputPaths(job, INPUT_PATH);
		FileOutputFormat.setOutputPath(job, outputpath);
	
		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReduce.class);
		job.setCombinerClass(MyCombiner.class);
		
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		
		job.waitForCompletion(true);
	}

	}

map过程：

识别单词来自哪个文件，将单词和文件名设置为key，将value设置为 1 为以后的wordcount作准备。

context.write(word:tableName,1);

因为reduce过程无法同时完成单词统计和生成文件列表过程，所以加一个combiner过程。

combiner过程：

完成wordcount

context.write(word,tableName:count);

完成了单词在每个文件中的出现次数，输出后在reduce会合并，生成value-list，完成了单词在各个文件中的出现次数。

reduce过程：

生成文件列表。

list+=value.toString()+";" //单词在各个文件中出现的次数用分号分隔

context.write(word,list);

xxiaoMinGLL

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
MapReduce关联性操作（三）

倒排索引"倒排索引"是文档检索系统中最常用的数据结构，被广泛地应用于全文搜索引擎。它主要是用来存储某个单词（或词组）在一个文档或一组文档中的存储位置的映射，即提供了一种根据内容来查找文档的方式。由于不是根据文档来确定文档所包含的内容，而是进行相反的操作，因而称为倒排索引（Inverted Index）。实例描述：通常情况下，倒排索引由一个单词（或词组）以及相关的文档列表组成
复制链接

扫一扫