Hadoop 之 MapReduce 的工作原理及其倒排索引的建立_map倒排索引实现原理(1)

2401_84905102

于 2024-05-12 23:12:21 发布

阅读量264

点赞数 3

分类专栏：程序员文章标签： go 学习面试

本文链接：https://blog.csdn.net/2401_84905102/article/details/138771533

版权

程序员专栏收录该内容

60 篇文章 0 订阅

订阅专栏

既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上Go语言开发知识点，真正体系化！

由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新

如果你需要这些资料，可以戳这里获取

            throws IOException, InterruptedException {
        //默认的map的value是每一行,我这里自定义的是以空格分割
        String[] vs = value.toString().split("\\s");
        for (String v : vs) {
            //写出去
            context.write(new Text(v), ONE);
        }

    }
}

Reduce过程
Reduce过程需要继承org.apache.hadoop.mapreduce包中 Reducer 类，并重写其reduce方法。Map过程输出<key,values>中key为单个单词，而values是对应单词的计数值所组成的列表，Map的输出就是Reduce的输入，所以reduce方法只要遍历values并求和，即可得到某个单词的总次数。

//Reduce过程
/***
* Text, IntWritable输入类型,从map过程获得既map的输出作为Reduce的输入
* Text, IntWritable输出类型
*/
public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
@Override
protected void reduce(Text key, Iterable values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
int count=0;
for(IntWritable v:values){
count+=v.get();//单词个数加一
}

        context.write(key, new IntWritable(count));
    }
    
}

最后执行MapReduce任务

public static void main(String[] args) {

    Configuration conf=new Configuration();
    try {
        //args从控制台获取路径 解析得到域名
        String[] paths=new GenericOptionsParser(conf,args).getRemainingArgs();
        if(paths.length<2){
            throw new RuntimeException("必須輸出 輸入 和输出路径");
        }
        //得到一个Job 并设置名字
        Job job=Job.getInstance(conf,"wordcount");
        //设置Jar 使本程序在Hadoop中运行
        job.setJarByClass(WordCount.class);
        //设置Map处理类
        job.setMapperClass(WordCountMapper.class);
        //设置map的输出类型,因为不一致,所以要设置
        job.setMapOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        //设置Reduce处理类
        job.setReducerClass(WordCountReducer.class);
        //设置输入和输出目录
        FileInputFormat.addInputPath(job, new Path(paths[0]));
        FileOutputFormat.setOutputPath(job, new Path(paths[1]));
        //启动运行
        System.exit(job.waitForCompletion(true) ? 0:1);
    } catch (IOException e) {
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        e.printStackTrace();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}

即可求得每个单词的个数

下面把整个过程的源码附上,有需要的朋友可以拿去测试

package hadoopday02;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {
//计数变量
private static final IntWritable ONE = new IntWritable(1);
/**
*
* @author 汤高
* Mapper<LongWritable, Text, Text, IntWritable>中 LongWritable,IntWritable是Hadoop数据类型表示长整型和整形
*
* LongWritable, Text表示输入类型 (比如本应用单词计数输入是偏移量(字符串中的第一个单词的其实位置),对应的单词(值))
* Text, IntWritable表示输出类型输出是单词和他的个数
* 注意：map函数中前两个参数LongWritable key, Text value和输出类型不一致
* 所以后面要设置输出类型要使他们一致
/
//Map过程
public static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
/**
*
*/
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
//默认的map的value是每一行,我这里自定义的是以空格分割
String[] vs = value.toString().split(“\s”);
for (String v : vs) {
//写出去
context.write(new Text(v), ONE);
}

    }
}
//Reduce过程
/***
 * Text, IntWritable输入类型,从map过程获得 既map的输出作为Reduce的输入
 * Text, IntWritable输出类型
 */
public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,
            Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        int count=0;
        for(IntWritable v:values){
            count+=v.get();//单词个数加一
        }
        
        context.write(key, new IntWritable(count));
    }
    
}

public static void main(String[] args) {
    
    Configuration conf=new Configuration();
    try {
        //args从控制台获取路径 解析得到域名
        String[] paths=new GenericOptionsParser(conf,args).getRemainingArgs();
        if(paths.length<2){
            throw new RuntimeException("必須輸出 輸入 和输出路径");
        }
        //得到一个Job 并设置名字
        Job job=Job.getInstance(conf,"wordcount");
        //设置Jar 使本程序在Hadoop中运行
        job.setJarByClass(WordCount.class);
        //设置Map处理类
        job.setMapperClass(WordCountMapper.class);
        //设置map的输出类型,因为不一致,所以要设置
        job.setMapOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        //设置Reduce处理类
        job.setReducerClass(WordCountReducer.class);
        //设置输入和输出目录
        FileInputFormat.addInputPath(job, new Path(paths[0]));
        FileOutputFormat.setOutputPath(job, new Path(paths[1]));
        //启动运行
        System.exit(job.waitForCompletion(true) ? 0:1);
    } catch (IOException e) {
        e.printStackTrace();
    } catch (ClassNotFoundException e) {
        e.printStackTrace();
    } catch (InterruptedException e) {
        e.printStackTrace();
    }
}

}


 


## 二、通过 Hadoop 建立倒排索引


倒排索引就是根据单词内容来查找文档的方式，由于不是根据文档来确定文档所包含的内容，进行了相反的操作，所以被称为倒排索引, 它是搜索引擎最为核心的数据结构，以及文档检索的关键部分。


下面来看一个例子来理解什么是倒排索引


这里我准备了两个文件 分别为1.txt和2.txt


1.txt的内容如下

I Love Hadoop
I like ZhouSiYuan
I love me




---


 


2.txt的内容如下

I Love MapReduce
I like NBA
I love Hadoop


 


我这里使用的是默认的输入格式TextInputFormat，他是一行一行的读的，键是偏移量。  
  


所以在map阶段之前的到结果如下   
 map阶段从1.txt的得到的输入

0 I Love Hadoop
15 I like ZhouSiYuan
34 I love me


map阶段从2.txt的得到的输入

0 I Love MapReduce
18 I like NBA
30 I love Hadoop


map阶段   
 把词频作为值   
 把单词和URI组成key值   
 比如   
 key : I+hdfs://192.168.52.140:9000/index/2.txt value:1


为什么要这样设置键和值？   
 因为这样设计可以使用MapReduce框架自带的map端排序，将同一单词的词频组成列表


经过map阶段1.txt得到的输出如下

I:hdfs://192.168.52.140:9000/index/1.txt 1
Love:hdfs://192.168.52.140:9000/index/1.txt 1
MapReduce:hdfs://192.168.52.140:9000/index/1.txt 1
I:hdfs://192.168.52.140:9000/index/1.txt 1
Like:hdfs://192.168.52.140:9000/index/1.txt 1
ZhouSiYuan:hdfs://192.168.52.140:9000/index/1.txt 1
I:hdfs://192.168.52.140:9000/index/1.txt 1
love:hdfs://192.168.52.140:9000/index/1.txt 1
me:hdfs://192.168.52.140:9000/index/1.txt 1


经过map阶段2.txt得到的输出如下

I:hdfs://192.168.52.140:9000/index/2.txt 1
Love:hdfs://192.168.52.140:9000/index/2.txt 1
MapReduce:hdfs://192.168.52.140:9000/index/2.txt 1
I:hdfs://192.168.52.140:9000/index/2.txt 1
Like:hdfs://192.168.52.140:9000/index/2.txt 1
NBA:hdfs://192.168.52.140:9000/index/2.txt 1
I:hdfs://192.168.52.140:9000/index/2.txt 1
love:hdfs://192.168.52.140:9000/index/2.txt 1
Hadoop:hdfs://192.168.52.140:9000/index/2.txt 1


1.txt经过MapReduce框架自带的map端排序得到的输出结果如下

I:hdfs://192.168.52.140:9000/index/1.txt list{1,1,1}
Love:hdfs://192.168.52.140:9000/index/1.txt list{1}
MapReduce:hdfs://192.168.52.140:9000/index/1.txt list{1}
Like:hdfs://192.168.52.140:9000/index/1.txt list{1}
ZhouSiYuan:hdfs://192.168.52.140:9000/index/1.txt list{1}
love:hdfs://192.168.52.140:9000/index/1.txt list{1}
me:hdfs://192.168.52.140:9000/index/1.txt list{1}


2.txt经过MapReduce框架自带的map端排序得到的输出结果如下

I:hdfs://192.168.52.140:9000/index/2.txt list{1,1,1}
Love:hdfs://192.168.52.140:9000/index/2.txt list{1}
MapReduce:hdfs://192.168.52.140:9000/index/2.txt list{1}
Like:hdfs://192.168.52.140:9000/index/2.txt list{1}
NBA:hdfs://192.168.52.140:9000/index/2.txt list{1}
love:hdfs://192.168.52.140:9000/index/2.txt list{1}
Hadoop:hdfs://192.168.52.140:9000/index/2.txt list{1}


combine阶段：   
 key值为单词，   
 value值由URI和词频组成   
 value: hdfs://192.168.52.140:9000/index/2.txt:3 key:I   
 为什么这样设计键值了？   
 因为在Shuffle过程将面临一个问题，所有具有相同单词的记录(由单词、URL和词频组成)应该交由同一个Reducer处理   
 所以重新把单词设置为键可以使用MapReduce框架默认的Shuffle过程,将相同单词的所有记录发送给同一个Reducer处理


combine阶段将key相同的value值累加


1.txt得到如下输出

I hdfs://192.168.52.140:9000/index/1.txt:3
Love hdfs://192.168.52.140:9000/index/1.txt:1
MapReduce hdfs://192.168.52.140:9000/index/1.txt:1
Like hdfs://192.168.52.140:9000/index/1.txt:1
ZhouSiYuan hdfs://192.168.52.140:9000/index/1.txt:1
love hdfs://192.168.52.140:9000/index/1.txt:1
me hdfs://192.168.52.140:9000/index/1.txt:1


2.txt得到如下输出

I hdfs://192.168.52.140:9000/index/2.txt:3
Love hdfs://192.168.52.140:9000/index/2.txt:1
MapReduce hdfs://192.168.52.140:9000/index/2.txt:1
Like hdfs://192.168.52.140:9000/index/2.txt:1
NBA hdfs://192.168.52.140:9000/index/2.txt:1
love hdfs://192.168.52.140:9000/index/2.txt:1
Hadoop hdfs://192.168.52.140:9000/index/2.txt:1


这样reducer过程就很简单了，它只用来生成文档列表   
 比如相同的单词I，这样生成文档列表   
 I hdfs://192.168.52.140:9000/index/2.txt:3;hdfs://192.168.52.140:9000/index/1.txt:3;


最后所有的输出结果如下




![img](https://img-blog.csdnimg.cn/img_convert/72c71340cea72ce83f90fd213409cc3d.png)
![img](https://img-blog.csdnimg.cn/img_convert/58c70f486d20cca931a6359351561a68.png)
![img](https://img-blog.csdnimg.cn/img_convert/8cdbd9ccb455c9fdc0ee9eb51c6fb2ca.png)

**既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上Go语言开发知识点，真正体系化！**

**由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新**

**[如果你需要这些资料，可以戳这里获取](https://bbs.csdn.net/topics/618658159)**

140:9000/index/2.txt:3;hdfs://192.168.52.140:9000/index/1.txt:3;


最后所有的输出结果如下




[外链图片转存中...(img-Pquho5pY-1715526718621)]
[外链图片转存中...(img-64eOkgY0-1715526718621)]
[外链图片转存中...(img-y9AftMP8-1715526718621)]

**既有适合小白学习的零基础资料，也有适合3年以上经验的小伙伴深入学习提升的进阶课程，涵盖了95%以上Go语言开发知识点，真正体系化！**

**由于文件比较多，这里只是将部分目录截图出来，全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、讲解视频，并且后续会持续更新**

**[如果你需要这些资料，可以戳这里获取](https://bbs.csdn.net/topics/618658159)**

2401_84905102

关注

3
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Hadoop 之 MapReduce 的工作原理及其倒排索引的建立_map倒排索引实现原理(1)

Map过程输出中key为单个单词，而values是对应单词的计数值所组成的列表，Map的输出就是Reduce的输入，所以reduce方法只要遍历values并求和，即可得到某个单词的总次数。* LongWritable, Text表示输入类型 (比如本应用单词计数输入是偏移量(字符串中的第一个单词的其实位置),对应的单词(值))* Text, IntWritable输入类型,从map过程获得既map的输出作为Reduce的输入。//Reduce过程。
复制链接

扫一扫