MapReduce Inverted Index

最新推荐文章于 2023-04-25 09:36:47 发布

张嘉睿大聪明

最新推荐文章于 2023-04-25 09:36:47 发布

阅读量253

点赞数

分类专栏：分布式计算系统文章标签： mapreduce 大数据分布式

本文链接：https://blog.csdn.net/weixin_45975575/article/details/125447121

版权

分布式计算系统专栏收录该内容

7 篇文章 0 订阅

订阅专栏

题目描述

倒排索引是 Elasticsearch 中非常重要的索引结构，是从文档单词到文档 ID 的过程。倒排索引源于实际应用中需要根据属性的值来查找记录。这种索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。由于不是由记录来确定属性值，而是由属性值来确定记录的位置，因而称为倒排索引(inverted index) 现实中，倒排索引主要应用于搜索引擎中，用于构建单词到文档的索引，从而能够快速的通过用户的输入查找相关的网页。
本题目需要实现构建倒排索引的过程。具体来说，给定一组英文文档，使用空格进行分词（文档中不包含标点符号），将所有单词转换为小写，并排除停用词（stop word）后，建立单词的倒排索引(输出key为单词，value为以文件名和单词出现次数组成的字符串，不同文件之间用";"分割，详见样例)。

样例

输入

//输入由多个文件的文本内容构成，下面列举了两个文件的文本内容
//www.bbc.comnewsworld-asia-china-60615280
Ukraine invasion Can China do more to stop Russia’s war in Ukraine
//www.bbc.comnewsworld-europe-60506682
Ukraine maps Ukraine says Russian ceasefire offer immoral
// stopwords.txt
can
and
to
in

输出

//输出格式为单词文件名1:次数1;文件名2:次数2;
Ukraine www.bbc.comnewsworld-asia-china-60615280:2;www.bbc.comnewsworld-europe-60506682:2
invasion www.bbc.comnewsworld-asia-china-60615280::1;
China www.bbc.comnewsworld-asia-china-60615280::1;
do www.bbc.comnewsworld-asia-china-60615280::1;
more www.bbc.comnewsworld-asia-china-60615280::1;
stop www.bbc.comnewsworld-asia-china-60615280::1;
Russia’s www.bbc.comnewsworld-asia-china-60615280::1;
war www.bbc.comnewsworld-asia-china-60615280::1;
maps www.bbc.comnewsworld-europe-60506682:1;
says www.bbc.comnewsworld-europe-60506682:1;
Russian www.bbc.comnewsworld-europe-60506682:1;
ceasefire www.bbc.comnewsworld-europe-60506682:1;
offer www.bbc.comnewsworld-europe-60506682:1;
immoral www.bbc.comnewsworld-europe-60506682:1;

新建DSPPCode.mapreduce.inverted_index.impl文件夹；在DSPPCode.mapreduce.inverted_index.impl中创建InvertedIndexMapperImpl, 继承InvertedIndexMapper, 实现抽象方法；在DSPPCode.mapreduce.inverted_index.impl中创建InvertedIndexReducerImpl, 继承InvertedIndexReducer, 实现抽象方法。

3、代码

InvertedIndexMapperImpl.java

package DSPPCode.mapreduce.inverted_index.impl;

import DSPPCode.mapreduce.inverted_index.question.InvertedIndexMapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.ArrayList;
import java.util.StringTokenizer;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

public class InvertedIndexMapperImpl extends InvertedIndexMapper{
  private static Text Key = new Text();
  private static Text Value = new Text();
  private FileSplit split;

  @Override
  public void map(Object key, Text value, Mapper<Object, Text, Text, Text>.Context context)
      throws IOException, InterruptedException {
    URI uri=context.getCacheFiles()[0];
    FileSystem fs = FileSystem.get(uri, new Configuration());

    FSDataInputStream a = fs.open(new Path(uri));
    BufferedReader x = new BufferedReader(new InputStreamReader(a));
    ArrayList<String> stopwords=new ArrayList<>();
    String l;
    while ((l=x.readLine())!=null){
      stopwords.add(l.toLowerCase());
    }
    // System.out.println(stopwords);

    // String[] list=value.toString().trim().split(" ");
    // split = (FileSplit) context.getInputSplit();
    // for (String word:list){
    //   word=word.toLowerCase();
    //   if (stopwords.contains(word)){
    //     continue;
    //   }
    //   keyInfo.set(word);
    //   valueInfo.set(split.getPath().getName());
    //   context.write(keyInfo,valueInfo);
    // }

    split = (FileSplit) context.getInputSplit();
    StringTokenizer list = new StringTokenizer(value.toString());
    while(list.hasMoreTokens()){
      String word = list.nextToken();
      word = word.toLowerCase();
      if(stopwords.contains(word)) {
        continue;
      }
      Key.set(word);
      Value.set(split.getPath().getName());
      context.write(Key, Value);
    }
  }
}

InvertedIndexReducerImpl.java

package DSPPCode.mapreduce.inverted_index.impl;

import DSPPCode.mapreduce.inverted_index.question.InvertedIndexReducer;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

public class InvertedIndexReducerImpl extends InvertedIndexReducer{
  @Override
  public void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
      throws IOException, InterruptedException {
    Map<String, Integer> map = new HashMap<>();
    for(Text value : values) {
      String val = value.toString();
      // System.out.println(val);
      map.merge(val, 1, Integer::sum);
    }

    StringBuilder stringBuilder = new StringBuilder();
    for (String x : map.keySet()) {
      stringBuilder.append(x).append(":").append(map.get(x)).append(";");
    }

    context.write(new Text(key), new Text(String.valueOf(stringBuilder)));
  }
}

张嘉睿大聪明

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
MapReduce Inverted Index

倒排索引是 Elasticsearch 中非常重要的索引结构，是从文档单词到文档 ID 的过程。倒排索引源于实际应用中需要根据属性的值来查找记录。这种索引表中的每一项都包括一个属性值和具有该属性值的各记录的地址。由于不是由记录来确定属性值，而是由属性值来确定记录的位置，因而称为倒排索引(inverted index) 现实中，倒排索引主要应用于搜索引擎中，用于构建单词到文档的索引，从而能够快速的通过用户的输入查找相关的网页。本题目需要实现构建倒排索引的过程。具体来说，给定一组英文文档，使用空格进行分词
复制链接

扫一扫