倒排索引——MapReduce实战

最新推荐文章于 2024-04-28 18:20:22 发布

转载最新推荐文章于 2024-04-28 18:20:22 发布 · 563 阅读

Linux 专栏收录该内容

23 篇文章

订阅专栏

本文介绍倒排索引的基本概念与应用，探讨其在文档检索系统中的重要作用，并通过具体示例展示如何利用MapReduce实现倒排索引。

1.倒排索引简介

倒排索引（Inverted index），也常被称为反向索引、置入档案或反向档案，是一种索引方法，被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。它是文档检索系统中最常用的数据结构。

有两种不同的反向索引形式：

一条记录的水平反向索引（或者反向档案索引）包含每个引用单词的文档的列表。
一个单词的水平反向索引（或者完全反向索引）又包含每个单词在一个文档中的位置。

后者的形式提供了更多的兼容性（比如短语搜索），但是需要更多的时间和空间来创建。

举例：

以英文为例，下面是要被索引的文本：

T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"

我们就能得到下面的反向文件索引：

检索的条件"what", "is" 和 "it" 将对应这个集合：{0,1}∩{0,1,2}∩{0,1,2}={0,1}。

对相同的文字，我们得到后面这些完全反向索引，有文档数量和当前查询的单词结果组成的的成对数据。同样，文档数量和当前查询的单词结果都从零开始。

所以，"banana": {(2, 3)} 就是说 “banana”在第三个文档里 (T2)，而且在第三个文档的位置是第四个单词(地址为 3)。

如果我们执行短语搜索"what is it" 我们得到这个短语的全部单词各自的结果所在文档为文档0和文档1。但是这个短语检索的连续的条件仅仅在文档1得到。

2.分析和设计

（1）Map过程

首先使用默认的TextInputFormat类对输入文件进行处理，得到文本中每行的偏移量及其内容，Map过程首先必须分析输入的<key, value>对，得到倒排索引中需要的三个信息：单词、文档URI和词频，如图所示：

存在两个问题，第一：<key, value>对只能有两个值，在不使用Hadoop自定义数据类型的情况下，需要根据情况将其中的两个值合并成一个值，作为value或key值；

第二，通过一个Reduce过程无法同时完成词频统计和生成文档列表，所以必须增加一个Combine过程完成词频统计

 
Java
 
         1
        
         2
        
         3
        
         4
        
         5
        
         6
        
         7
        
         8
        
         9
        
         10
        
         11
        
         12
        
         13
        
         14
        
         15
        
         16
        
         17
        
         18
        
         19
        
         public
          
         static
          
         class
          
         InvertedIndexMapper 
         extends
          
         Mapper
         <
         Object
         ,
          
         Text
         ,
          
         Text
         ,
          
         Text
         >
          
         {
        
         private
          
         Text 
         keyInfo
          
         =
          
         new
          
         Text
         (
         )
         ;
           
         //存储单词和URI的组合
        
         private
          
         Text 
         valueInfo
          
         =
          
         new
          
         Text
         (
         )
         ;
         //存储词频
        
         private
          
         FileSplit 
         split
         ;
                     
         //存储Split对象
        
         public
          
         void
          
         map
         (
         Object
          
         key
         ,
          
         Text 
         value
         ,
          
         Context 
         context
         )
          
         throws
          
         IOException
         ,
          
         InterruptedException
          
         {
        
         //获得<key,value>对所属的FileSplit对象
        
         split
          
         =
          
         (
         FileSplit
         )
         context
         .
         getInputSplit
         (
         )
         ;
        
         StringTokenizer 
         itr
          
         =
          
         new
          
         StringTokenizer
         (
         value
         .
         toString
         (
         )
         )
         ;
        
         while
         (
         itr
         .
         hasMoreTokens
         (
         )
         )
          
         {
        
         //key值由单词和URI组成，如"MapReduce:1.txt"
        
         keyInfo
         .
         set
         (
         itr
         .
         nextToken
         (
         )
          
         +
          
         ":"
          
         +
          
         split
         .
         getPath
         (
         )
         .
         toString
         (
         )
         )
         ;
        
         // 词频初始为1
        
         valueInfo
         .
         set
         (
         "1"
         )
         ;
        
         context
         .
         write
         (
         keyInfo
         ,
          
         valueInfo
         )
         ;
        
         }
        
         }
        
         }

（2）Combine过程

将key值相同的value值累加，得到一个单词在文档中的词频，如图

Java

          1
        
          2
        
          3
        
          4
        
          5
        
          6
        
          7
        
          8
        
          9
        
          10
        
          11
        
          12
        
          13
        
          14
        
          15
        
          16
        
          17
        
         public
          
         static
          
         class
          
         InvertedIndexCombiner 
         extends
          
         Reducer
         <
         Text
         ,
          
         Text
         ,
          
         Text
         ,
          
         Text
         >
          
         {
        
         private
          
         Text 
         info
          
         =
          
         new
          
         Text
         (
         )
         ;
        
         public
          
         void
          
         reduce
         (
         Text 
         key
         ,
          
         Iterable
         <Text>
         values
         ,
          
         Context 
         context
         )
          
         throws
          
         IOException
         ,
          
         InterruptedException
          
         {
        
         //统计词频
        
         int
          
         sum
          
         =
          
         0
         ;
        
         for
         (
         Text 
         value
          
         :
          
         values
         )
          
         {
        
         sum
          
         +=
          
         Integer
         .
         parseInt
         (
         value
         .
         toString
         (
         )
         )
         ;
        
         }
        
         int
          
         splitIndex
         =
          
         key
         .
         toString
         (
         )
         .
         indexOf
         (
         ":"
         )
         ;
        
         //重新设置value值由URI和词频组成
        
         info
         .
         set
         (
         key
         .
         toString
         (
         )
         .
         substring
         (
         splitIndex
          
         +
          
         1
         )
          
         +
          
         ":"
          
         +
          
         sum
         )
         ;
        
         //重新设置key值为单词
        
         key
         .
         set
         (
         key
         .
         toString
         (
         )
         .
         substring
         (
         0
         ,
          
         splitIndex
         )
         )
         ;
        
         context
         .
         write
         (
         key
         ,
          
         info
         )
         ;
        
         }
        
         }

（3）Reduce过程

讲过上述两个过程后，Reduce过程只需将相同key值的value值组合成倒排索引文件所需的格式即可，剩下的事情就可以直接交给MapReduce框架进行处理了

 
Java
 
         1
        
         2
        
         3
        
         4
        
         5
        
         6
        
         7
        
         8
        
         9
        
         10
        
         11
        
         12
        
         public
          
         static
          
         class
          
         InvertedIndexReducer 
         extends
          
         Reducer
         <
         Text
         ,
          
         Text
         ,
          
         Text
         ,
          
         Text
         >
          
         {
        
         private
          
         Text 
         result
          
         =
          
         new
          
         Text
         (
         )
         ;
        
         public
          
         void
          
         reducer
         (
         Text 
         key
         ,
          
         Iterable
         <Text>
         values
         ,
          
         Context 
         context
         )
          
         throws
          
         IOException
         ,
          
         InterruptedException
          
         {
        
         //生成文档列表
        
         String
          
         fileList
          
         =
          
         new
          
         String
         (
         )
         ;
        
         for
         (
         Text 
         value
          
         :
          
         values
         )
          
         {
        
         fileList
          
         +=
          
         value
         .
         toString
         (
         )
          
         +
          
         ";"
         ;
        
         }
        
         result
         .
         set
         (
         fileList
         )
         ;
        
         context
         .
         write
         (
         key
         ,
          
         result
         )
         ;
        
         }
        
         }

完整代码如下：

Java

   
  
 
   
          1
        

          2
        

          3
        

          4
        

          5
        

          6
        

          7
        

          8
        

          9
        

          10
        

          11
        

          12
        

          13
        

          14
        

          15
        

          16
        

          17
        

          18
        

          19
        

          20
        

          21
        

          22
        

          23
        

          24
        

          25
        

          26
        

          27
        

          28
        

          29
        

          30
        

          31
        

          32
        

          33
        

          34
        

          35
        

          36
        

          37
        

          38
        

          39
        

          40
        

          41
        

          42
        

          43
        

          44
        

          45
        

          46
        

          47
        

          48
        

          49
        

          50
        

          51
        

          52
        

          53
        

          54
        

          55
        

          56
        

          57
        

          58
        

          59
        

          60
        

          61
        

          62
        

          63
        

          64
        

          65
        

          66
        

          67
        

          68
        

          69
        

          70
        

          71
        

          72
        

          73
        

          74
        

          75
        

          76
        

          77
        

          78
        

          79
        

          80
        
 
         import
          
         java
         .
         io
         .
         IOException
         ;
        
 
         import
          
         java
         .
         util
         .
         StringTokenizer
         ;
        
 
         import
          
         org
         .
         apache
         .
         hadoop
         .
         conf
         .
         Configuration
         ;
        
 
         import
          
         org
         .
         apache
         .
         hadoop
         .
         fs
         .
         Path
         ;
        
 
         import
          
         org
         .
         apache
         .
         hadoop
         .
         io
         .
         IntWritable
         ;
        
 
         import
          
         org
         .
         apache
         .
         hadoop
         .
         io
         .
         Text
         ;
        
 
         import
          
         org
         .
         apache
         .
         hadoop
         .
         mapreduce
         .
         Job
         ;
        
 
         import
          
         org
         .
         apache
         .
         hadoop
         .
         mapreduce
         .
         Mapper
         ;
        
 
         import
          
         org
         .
         apache
         .
         hadoop
         .
         mapreduce
         .
         Reducer
         ;
        
 
         import
          
         org
         .
         apache
         .
         hadoop
         .
         mapreduce
         .
         lib
         .
         input
         .
         FileInputFormat
         ;
        
 
         import
          
         org
         .
         apache
         .
         hadoop
         .
         mapreduce
         .
         lib
         .
         input
         .
         FileSplit
         ;
        
 
         import
          
         org
         .
         apache
         .
         hadoop
         .
         mapreduce
         .
         lib
         .
         output
         .
         FileOutputFormat
         ;
        
 
         import
          
         org
         .
         apache
         .
         hadoop
         .
         util
         .
         GenericOptionsParser
         ;
        

           
        
 
         public
          
         class
          
         InvertedIndex
          
         {
        
 
             
         public
          
         static
          
         class
          
         InvertedIndexMapper 
         extends
          
         Mapper
         <
         Object
         ,
          
         Text
         ,
          
         Text
         ,
          
         Text
         >
          
         {
        
 
                 
         private
          
         Text 
         keyInfo
          
         =
          
         new
          
         Text
         (
         )
         ;
        
 
                 
         private
          
         Text 
         valueInfo
          
         =
          
         new
          
         Text
         (
         )
         ;
        
 
                 
         private
          
         FileSplit 
         split
         ;
        
 
                 
        
 
                 
         public
          
         void
          
         map
         (
         Object
          
         key
         ,
          
         Text 
         value
         ,
          
         Context 
         context
         )
          
         throws
          
         IOException
         ,
          
         InterruptedException
          
         {
        
 
                     
         split
          
         =
          
         (
         FileSplit
         )
         context
         .
         getInputSplit
         (
         )
         ;
        
 
                     
         StringTokenizer 
         itr
          
         =
          
         new
          
         StringTokenizer
         (
         value
         .
         toString
         (
         )
         )
         ;
        
 
                     
        
 
                     
         while
         (
         itr
         .
         hasMoreTokens
         (
         )
         )
          
         {
        
 
                         
         keyInfo
         .
         set
         (
         itr
         .
         nextToken
         (
         )
          
         +
          
         ":"
          
         +
          
         split
         .
         getPath
         (
         )
         .
         toString
         (
         )
         )
         ;
        
 
                         
         valueInfo
         .
         set
         (
         "1"
         )
         ;
        
 
                         
         context
         .
         write
         (
         keyInfo
         ,
          
         valueInfo
         )
         ;
        
 
                     
         }
        
 
                 
         }
        
 
                 
        
 
             
         }
        
 
             
         public
          
         static
          
         class
          
         InvertedIndexCombiner 
         extends
          
         Reducer
         <
         Text
         ,
          
         Text
         ,
          
         Text
         ,
          
         Text
         >
          
         {
        
 
                 
         private
          
         Text 
         info
          
         =
          
         new
          
         Text
         (
         )
         ;
        
 
                 
         public
          
         void
          
         reduce
         (
         Text 
         key
         ,
          
         Iterable
         <Text>
         values
         ,
          
         Context 
         context
         )
          
         throws
          
         IOException
         ,
          
         InterruptedException
          
         {
        
 
                     
         int
          
         sum
          
         =
          
         0
         ;
        
 
                     
         for
         (
         Text 
         value
          
         :
          
         values
         )
          
         {
        
 
                         
         sum
          
         +=
          
         Integer
         .
         parseInt
         (
         value
         .
         toString
         (
         )
         )
         ;
        
 
                     
         }
        
 
                     
         int
          
         splitIndex
         =
          
         key
         .
         toString
         (
         )
         .
         indexOf
         (
         ":"
         )
         ;
        
 
                     
         info
         .
         set
         (
         key
         .
         toString
         (
         )
         .
         substring
         (
         splitIndex
          
         +
          
         1
         )
          
         +
          
         ":"
          
         +
          
         sum
         )
         ;
        
 
                     
         key
         .
         set
         (
         key
         .
         toString
         (
         )
         .
         substring
         (
         0
         ,
          
         splitIndex
         )
         )
         ;
        
 
                     
         context
         .
         write
         (
         key
         ,
          
         info
         )
         ;
        
 
                 
         }
        
 
             
         }
        
 
             
         public
          
         static
          
         class
          
         InvertedIndexReducer 
         extends
          
         Reducer
         <
         Text
         ,
          
         Text
         ,
          
         Text
         ,
          
         Text
         >
          
         {
        
 
                 
         private
          
         Text 
         result
          
         =
          
         new
          
         Text
         (
         )
         ;
        
 
                 
         public
          
         void
          
         reducer
         (
         Text 
         key
         ,
          
         Iterable
         <Text>
         values
         ,
          
         Context 
         context
         )
          
         throws
          
         IOException
         ,
          
         InterruptedException
          
         {
        
 
                     
         String
          
         fileList
          
         =
          
         new
          
         String
         (
         )
         ;
        
 
                     
         for
         (
         Text 
         value
          
         :
          
         values
         )
          
         {
        
 
                         
         fileList
          
         +=
          
         value
         .
         toString
         (
         )
          
         +
          
         ";"
         ;
        
 
                     
         }
        
 
                     
         result
         .
         set
         (
         fileList
         )
         ;
        
 
                     
         context
         .
         write
         (
         key
         ,
          
         result
         )
         ;
        
 
                 
         }
        
 
             
         }
        
 
             
         public
          
         static
          
         void
          
         main
         (
         String
         [
         ]
          
         args
         )
          
         throws
          
         Exception
         {
        
 
                 
         // TODO Auto-generated method stub
        
 
                 
         Configuration 
         conf
          
         =
          
         new
          
         Configuration
         (
         )
         ;
        
 
                 
         String
         [
         ]
          
         otherArgs
          
         =
          
         new
          
         GenericOptionsParser
         (
         conf
         ,
          
         args
         )
         .
         getRemainingArgs
         (
         )
         ;
        
 
                 
         if
         (
         otherArgs
         .
         length
          
         !=
          
         2
         )
          
         {
        
 
                     
         System
         .
         err
         .
         println
         (
         "Usage: wordcount <in> <out>"
         )
         ;
        
 
                     
         System
         .
         exit
         (
         2
         )
         ;
        
 
                 
         }
        
 
                 
         Job 
         job
          
         =
          
         new
          
         Job
         (
         conf
         ,
          
         "InvertedIndex"
         )
         ;
        
 
                 
         job
         .
         setJarByClass
         (
         InvertedIndex
         .
         class
         )
         ;
        
 
                 
         job
         .
         setMapperClass
         (
         InvertedIndexMapper
         .
         class
         )
         ;
        
 
                 
         job
         .
         setMapOutputKeyClass
         (
         Text
         .
         class
         )
         ;
        
 
                 
         job
         .
         setMapOutputValueClass
         (
         Text
         .
         class
         )
         ;
        
 
                 
         job
         .
         setCombinerClass
         (
         InvertedIndexCombiner
         .
         class
         )
         ;
        
 
                 
         job
         .
         setReducerClass
         (
         InvertedIndexReducer
         .
         class
         )
         ;
        
 
                 
        
 
                 
         job
         .
         setOutputKeyClass
         (
         Text
         .
         class
         )
         ;
        
 
                 
         job
         .
         setOutputValueClass
         (
         Text
         .
         class
         )
         ;
        
 
                 
        
 
                 
         FileInputFormat
         .
         addInputPath
         (
         job
         ,
          
         new
          
         Path
         (
         otherArgs
         [
         0
         ]
         )
         )
         ;
        
 
                 
         FileOutputFormat
         .
         setOutputPath
         (
         job
         ,
          
         new
          
         Path
         (
         otherArgs
         [
         1
         ]
         )
         )
         ;
        
 
                 
         System
         .
         exit
         (
         job
         .
         waitForCompletion
         (
         true
         )
          
         ?
          
         0
          
         :
          
         1
         )
         ;
        
 
             
         }
        
 
         }