hadoop–MapReduce倒排索引

最新推荐文章于 2024-04-25 23:41:35 发布

贾高兴

最新推荐文章于 2024-04-25 23:41:35 发布

阅读量2.6k

点赞数

分类专栏：大数据文章标签：大数据 hadoop mapreduce

本文链接：https://blog.csdn.net/qq_41135504/article/details/108776333

版权

大数据专栏收录该内容

1 篇文章 0 订阅

订阅专栏

hadoop–MapReduce倒排索引

1.倒排索引介绍

倒排索引是文档检索系统中最常用的数据结构，被广泛应用于全文搜索引擎。倒排索引主要用来存储某个单词（或词组）在一组文档中的存储位置的映射，提供了可以根据内容来查找文档的方式，而不是根据文档来确定内容，因此称为倒排索引（Inverted Index）。带有倒排索引的文件我们称为倒排索引文件，简称倒排文件(Inverted File)。

2.案例需求及分析

1. 假设有三个源文件file1.txt，file2.txt，file3.txt，需要使用倒排索引的方法对这三个源文件内容实现倒排索引，并将最后的倒排索引文件输出。

1. 经过Map阶段数据转换后，同一个文档中相同的单词会出现多个的情况，而单纯依靠后续Reduce阶段无法同时完成词频统计和生成文档列表，所以必须增加一个Combine阶段，先完成每一个文档的词频统计。

在这里插入图片描述

1. 经过上述两个阶段的处理后，Reduce阶段只需将所有文件中相同key值的value值进行统计，并组合成倒排索引文件所需的格式即可。

在这里插入图片描述

3.Map阶段的实现

创建Maven项目invertedIndex，在该路径下编写自定义的Mapper类invertedIndexMapper，主要用于将文本中的单词按照空格切分，然后以冒号拼接，“单词：文档名称”作为key，单词次数作为value，以文本方式输出到ComBine阶段。

public class InvertedIndexMapper extends Mapper<LongWritable, Text, Text, Text> {

	private static Text keyInfo = new Text();// 存储单词和 URL 组合
	private static final Text valueInfo = new Text("1");// 存储词频,初始化为1

	@Override
	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		String line = value.toString();
		String[] fields = StringUtils.split(line, " ");// 得到字段数组
		FileSplit fileSplit = (FileSplit) context.getInputSplit();// 得到这行数据所在的文件切片
		String fileName = fileSplit.getPath().getName();// 根据文件切片得到文件名
		for (String field : fields) {
			// key值由单词和URL组成，如“MapReduce:file1”
			keyInfo.set(field + ":" + fileName);
			context.write(keyInfo, valueInfo);
		}
	}
}

根据Map阶段的输出结果形式，在InvertedIndex包下，自定义实现Combine阶段的类InvertedIndexCombiner，对每个文档的单词进行词频统计。

package cn.dh.haoop.mr;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {

	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,
			Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
		// 1.局部汇总
		int count = 0;
		for (IntWritable v : values) {
			count += v.get();
		}
		context.write(key, new IntWritable(count));
	}
}

根据Combine阶段的输出结果形式，同样在cn.itcast.mr.InvertedIndex包下，自定义Reducer类InvertedIndexMapper，主要用于接收Combine阶段输出的数据，并最终案例倒排索引文件需求的样式，将单词作为key，多个文档名称和词频连接作为value，输出到目标目录。

package cn.dh.mr.invertedIndex;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class InvertedIndexReducer extends Reducer<Text, Text, Text, Text> {  
	  
    private static Text result = new Text();  
  
    // 输入：<MapReduce file3:2>  <MapReduce file1:1> <MapReduce file2:1>
    // 输出：<MapReduce file1:1;file2:1;file3:2;>  
    @Override  
    protected void reduce(Text key, Iterable<Text> values, Context context)  
            throws IOException, InterruptedException {  
        // 生成文档列表  
        String fileList = new String();  
        for (Text value : values) {  
            fileList += value.toString() + ";";  
        }  
  
        result.set(fileList);  
        context.write(key, result);  
    }  
}

编写MapReduce程序运行主类InvertedIndexRunner，主要用于设置MapReduce工作任务的相关参数。

package invertedIndex;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class InvertedIndexRunner {  
    public static void main(String[] args) throws IOException,  
            ClassNotFoundException, InterruptedException {  
        Configuration conf = new Configuration();  
        Job job = Job.getInstance(conf);  
  
        job.setJarByClass(InvertedIndexRunner.class);  
  
        job.setMapperClass(InvertedIndexMapper.class);  
        job.setCombinerClass(InvertedIndexCombiner.class);  
        job.setReducerClass(InvertedIndexReducer.class);  
  
        job.setOutputKeyClass(Text.class);  
        job.setOutputValueClass(Text.class);  
  
        FileInputFormat.setInputPaths(job, "hdfs://192.168.113.134:9000/invertedIndex/input/");
        // 指定处理完成之后的结果所保存的位置//hdfs://填写主机ip地址
        FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.113.134:9000/invertedIndex/output/"));
        
        // 向 yarn 集群提交这个 job
        boolean res = job.waitForCompletion(true);
        
        System.exit(res ? 0 : 1);
    }  
}

1. 执行MapReduce程序的程序入口InvertedIndexRunner类，正常执行完成后，会在指定的D:\InvertedIndex\output下生成结果文件。

依赖文件添加pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.jiagaoxing</groupId>
  <artifactId>mr-0923</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  
  
  
 
  <dependencies>
  	<dependency>
  		<groupId>org.apache.hadoop</groupId>
  		<artifactId>hadoop-common</artifactId>
  		<version>2.7.4</version>
  	</dependency>
  	<dependency>
  		<groupId>org.apache.hadoop</groupId>
  		<artifactId>hadoop-hdfs</artifactId>
  		<version>2.7.4</version>
  	</dependency>
  	<dependency>
  		<groupId>org.apache.hadoop</groupId>
  		<artifactId>hadoop-client</artifactId>
  		<version>2.7.4</version>
  	</dependency>
  	<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-mapreduce-client-core</artifactId>
			<version>2.7.4</version>
		</dependency>
  	<dependency>
  		<groupId>junit</groupId>
  		<artifactId>junit</artifactId>
  		<version>4.12</version>
  	</dependency>
  </dependencies>
  <build>
  		<!-- 配置打包名称 -->
		<finalName>hadoop-mr-inverted-index</finalName>
		<plugins>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-jar-plugin</artifactId>
				<version>2.4</version>
				<configuration>
					<archive>
						<manifest>
							<addClasspath>true</addClasspath>
							<classpathPrefix>lib/</classpathPrefix>
							<mainClass>invertedIndex.InvertedIndexRunner</mainClass>
						</manifest>
					</archive>
				</configuration>
			</plugin>
			<plugin>
				<groupId>org.apache.maven.plugins</groupId>
				<artifactId>maven-compiler-plugin</artifactId>
				<version>3.0</version>
				<configuration>
					<source>1.8</source>
					<target>1.8</target>
					<encoding>UTF-8</encoding>
				</configuration>
			</plugin>
		</plugins>
	</build>
</project>

4.程序打包运行

1. 将程序打包。
1. 将打包好的jar包传到集群的任意一台服务器上。
1. 将file1.txt，file2.txt，file3.txt上传到集群服务器hadoop01 /home/hadoop/目录
1. 准备输入数据
- 1. 在linux下，切换到/home/hadoop/apps/hadoop-2.7.4(这是我的Hadoop安装目录，切换成自己的)目录，开启hadoop服务
    - 1.cd sbin
    - 2…/start-all.sh
  2. 在hdfs上创建输入数据文件夹
    - 1.hadoop fs -mkdir -p /invertedIndex/intput
  3. 将file1.txt，file2.txt，file3.txt上传到hdfs上
    - 1.hadoop fs -put /home/hadoop/file1.txt /invertedIndex/input
    - 1.hadoop fs -put /home/hadoop/file2.txt /invertedIndex/input
    - 1.hadoop fs -put /home/hadoop/file3.txt /invertedIndex/input
  4. 使用命令启动执行打包好的jar包
    - hadoop jar hadoop-mr-invited-index.jar invitedIndex.InvertedIndexRunner /invertedIndex/input
  5. 查看运行结果
    - hadoop fs -cat /invertedIndex/output/part-r-00000

5.常见问题排查

Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://192.168.113.134:9000/invertedIndex/output already exists

运行已存在invertedIndex/output文件夹，需要手动删除
- hadoop fs -rm -r /inverteddIndex/output

贾高兴

关注

0
点赞
踩
24

收藏

觉得还不错? 一键收藏
0
评论
hadoop–MapReduce倒排索引

hadoop–MapReduce倒排索引1.倒排索引介绍倒排索引是文档检索系统中最常用的数据结构，被广泛应用于全文搜索引擎。倒排索引主要用来存储某个单词（或词组）在一组文档中的存储位置的映射，提供了可以根据内容来查找文档的方式，而不是根据文档来确定内容，因此称为倒排索引（Inverted Index）。带有倒排索引的文件我们称为倒排索引文件，简称倒排文件(Inverted File)。2.案例需求及分析假设有三个源文件file1.txt，file2.txt，file3.txt，需要使用倒排
复制链接

扫一扫