MapReduce 简单的全文搜索2

上一个全文搜索实现了模糊查找,这个主要实现了精确查找,就是比如你查找mapreduce is simple那么他就只查找有这个句子的文章,而不是查找有这三个单词的文章。
这个版本需要重写反向索引,因为需要查找句子,所以需要定位每个单词的在文章中的位置,所以我们的反向索引需要加上单词所在的位置,即我们希望的输出是:
MapReduce file1.txt:<1,2,3>;file2.txt:<5,3,1>;这种格式的。
其实这一步比较简单。我们在map的时候输出为
“filename+word” position这样的<key,value>
“file1.txt:MapReduce”1
经过本地的combiner将其输出为:
“filename” “word:<position>” 
"file1.txt" "MapReduce:<1,2,3>"
最后经过reduece将所有同一个文件的单词归一,输出为
"filename" "word1:<position>;word2:<position>...."
"file1.txt" "MapReduce:<1,2,3>;simple:<5,6,7>"这种格式的
PS:由于这里的读取是从文件中每次读取一行,所以这里的position只是每一行上的位置,为非该单词在全文中的位置,如果遇到一句话横跨两行,那么这个程序就无法识别了,好像需要重写那个Input了,等下一个版本再修改
 
接下来主要就是根据索引来查找
大致的思路就是
Map阶段通过需要查找的句子例如MapReduce is simple来筛选反向索引中的单词,最后经过Map后得到在被查找的句子中的单词。输出为:
"filename" "word<position>"
"file1.txt" "MapReduce<1,2,3>"
经过reduce,则会把所有相同的文件的word给放在一起。由于reduce中单词的顺序是混乱的,所以为了识别句子,我这里增加了一个类
class   Address  implements  Comparable<Address>{
                  public  String  word  ;
                  public  int  index ;
                Address(String word,  int  index){
                                  this . word  =word;
                                  this . index  =index;
                }
                  public  String toString(){
                                  return  word  + " " +  index ;
                }
                  public  int  compareTo(Address a){
                                  if ( index  <a. index )  return  -1;
                                  else  return  1;
                }
}
 
主要的word是用于放单词,index用于放索引,通过将同一个file下的value拆分到Address中,并且按照index进行排序,那么我们就能获得例如
M 1
M 2
M 3
i    4
s   5
i    6
M   7
(M代表Mapreduce i代表is,s代表simple)
那么如何识别这里的句子呢,首先这里的index必须是相邻的,并且这相邻的单词的顺序必须是M i s。为了识别相邻的单词的顺序问题,我这里新建了一个list,用于放输入的参数,也就是我要查找的句子,
ArrayList<String> sentence=  new  ArrayList<String>();
                                  for  (i=2;i<wordnum+2;i++){
                                                String arg=conf.get( "args"  +i);
                                                sentence.add(arg);
                                }
 
接下来我们建立两个游标,一个指向上一个 word position一个指向当前,如果说上一个的word和当前的word在sentence中的位置刚好是相邻的,并且两个index也是相邻的那么n++,接着这两个游标都往下一步走,继续判断,直到n等于句子中单词的长度,那就说明已经匹配到了一个完整的句子。接着n=1再继续往下走,直到遍历完
具体代码:
 
public  class  MyMapper  extends  Mapper<LongWritable, Text, Text, Text> {
 
                  public  void  map(LongWritable ikey, Text ivalue, Context context)
                                                  throws  IOException, InterruptedException {
                                Configuration conf=context.getConfiguration();
                                ArrayList< String> contents= new  ArrayList< String>();
                                  int  agrsnum=Integer.parseInt(conf.get(  "argsnum" ));
                                  int  i=0;
                                  for  (i=2;i<agrsnum;i++){
                                                 String arg=conf.get( "args" +i);
                                                contents.add(arg);
                                }
                                 String line=ivalue.toString();
                                 String key=line.split( "         " )[0];
                                 String value=line.split( "      " )[1];
                                  for (String content:contents){
                                                  if (content.compareTo(key)==0){
                                                                StringTokenizer st= new  StringTokenizer(value, ";"  );
                                                                  while (st.hasMoreTokens()){
                                                                                 String s=st.nextToken();
                                                                                 String filename=s.split( ":" )[0];
                                                                                 String adds=s.split( ":" )[1];
                                                                                 String val=key+adds;
                                                                                  //System.out.println(filename+"  "+ val);
                                                                                
                                                                                  //System.out.println("                             ");
                                                                                context.write(  new  Text(filename), new  Text(val));
                                                                }
                                                }
                                }
                }
 
}
 
 
 
class   Address  implements  Comparable<Address>{
                  public  String  word  ;
                  public  int  index ;
                Address(String word,  int  index){
                                  this . word  =word;
                                  this . index  =index;
                }
                  public  String toString(){
                                  return  word  + " " +  index ;
                }
                  public  int  compareTo(Address a){
                                  if ( index  <a. index )  return  -1;
                                  else  return  1;
                }
}
 
public  class  MyReducer  extends  Reducer<Text, Text, Text, Text> {
 
                  public  void  reduce(Text _key, Iterable<Text> values, Context context)
                                                  throws  IOException, InterruptedException {
                                  // process values
                                Configuration conf=context.getConfiguration();
                                  int  wordnum=Integer.parseInt(conf.get(  "argsnum" ))-2;
                                  int  i=0;
                                ArrayList<String> sentence=  new  ArrayList<String>();
                                  for  (i=2;i<wordnum+2;i++){
                                                String arg=conf.get( "args"  +i);
                                                sentence.add(arg);
                                }
                                
                                ArrayList<Address> list=  new  ArrayList<Address>();
                                
                                  for  (Text val : values) {
                                                String[] line=val.toString().split( "<|>|,"  );
                                                  for ( int  j=1;j<line. length ;j++){
                                                                Address a= new  Address(line[0],Integer.parseInt(line[j]));
                                                                list.add(a);
                                                }
                                                i++;
                                }
                                Collections. sort(list);
                                
                                  for (Address x:list){
                                                System.  out .println(x);
                                                System.  out .println( "                    "  );
                                }
                                
                                  int  sum=0;
                                  int  n=1;
                                Address start=list.get(0);
                  for (i=0;i<list.size();i++){
                                Address now=list.get(i);
                                  if (sentence.indexOf(now. word  )-sentence.indexOf(start. word )==1&&now.  index -start. index  ==1){
                                                n++;
                                                start.  word =now. word  ;
                                                start.  index =now. index  ;
                                }  else {
                                                n=1;
                                                start.  word =now. word  ;
                                                start.  index =now. index  ;
                                }
 
                                                  if (n==wordnum){
                                                                System.  out .println( "match is "  +now);
                                                                sum++;
                                                                n=1;
                                                }
                                                
                                
                }
                                
                                  /*
                                for (i=0;i<list.size()-2;i++){
                                Address t1=list.get(i);
                                Address t2=list.get(i+1);
                                Address t3=list.get(i+2);
                                if((t1.index+2)==t3.index&&(t2.index+1)==t3.index){
                                                if(t1.add!=t2.add&&t1.add!=t3.add&&t2.add!=t3.add){
                                                                sum++;
                                                }
                                }
                                
                }
                
                
                System.out.println("                                       ");
                System.out.println("sum is "+sum);
                System.out.println("                                       ");
                */
                                  if (sum>0){
                                                context.write(_key,  new  Text(String.valueOf(sum)));
                                }
                }
 
}
 
 
 

转载于:https://www.cnblogs.com/sunrye/p/4543370.html

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值