搜索引擎项目索引构建（二）-CSDN博客

本文链接：https://blog.csdn.net/a13931329858/article/details/140711938

文章和代码一起食用更佳：

SearchEngine · 王宇璇/submit - 码云 - 开源中国 (gitee.com)https://gitee.com/yxuan-wang/submit/tree/master/SearchEngine

构建索引

我们再上一期的内容中介绍了如何解析文件，接下来为大家介绍如和构建索引文件。

用下面这个两个数据结构分别保存正排和倒排索引。

正排索引

值得注意的是我们再DocInfo中添加一个DocId属性，我们用每个文档所在位置代表为其Id，此时就可以通过id寻找文档。就可以找到其标题，url，内容。

倒排索引

词=>文档id之间的映射关系

此时就需要对文档的内容（标题和正文）进行分词，每个分词都赋予一个数组来储存id，也就是说用户输入一个关键词，我们就通过HashMap对应到这个数组，数组中保存的都是与这个包含这个搜索词的文章的id，此时再用正排索引就可以得到这些文章的主要内容。

但是有一个问题，如果仅仅是建立一个id的数组，我们会忽略搜索中一些结果与搜索词关联性强，而一些结果关联性弱，此时就引入了权重的概念。

权重中一定包含docId，此外再包含一个属性用于计算权重，而大型浏览器的权重算法都是经过大数据不断改进的，我们这里就自己编造一个看起来合理的算法就可以。我是将标题出现次数和正文出现次数分开计算，weight = titleCount * 10 + contentCount;大家怎么合适怎么来。

parse解析出的是文档的标题，url，内容，我们新建一个类为DocInfo来保存这些内容，记得生成get和set方法。

构建正排索引

每解析一个文件就对这个文件进行接收然后通过正排索引放到ArrayList<DocInfo>的实例中去。

 private DocInfo buildForward(String title, String url, String content) {
        DocInfo docInfo = new DocInfo();
        //id=>文档之间的映射关系
        docInfo.setTitle(title);
        docInfo.setUrl(url);
        docInfo.setContent(content);
        synchronized (lock2){
            docInfo.setDocId(forwardIndex.size());
            forwardIndex.add(docInfo);
        }
        return docInfo;
    }

这个实现还是比较简单的，实例化DocInfo将解析出的属性set到docInfo中去

值得注意的是，当多个线程操作同一内容，也就是forwardIndex此时我们需要对其加锁，否则就会造成线程安全问题。返回docInfo是为之后倒排索引需要使用。

倒排索引

倒排索引相较于正排就比价困难了。

private void buildInverted(DocInfo docInfo) {
         class WordCnt{
             //表示这个词在标题中出现次数
             public int titleCount;
             //正文中出现的次数
             public int contentCount;
        }
        //这个数据结构用于统计词频
        HashMap<String , WordCnt> wordCntHashMap = new HashMap<>();

         //分词库会将所有的单词转换为小写，所以就不用自己转换了。一般搜索引擎的结果不区分大小写，所以统一使用小写有利于统计
         List<Term> terms = ToAnalysis.parse(docInfo.getTitle()).getTerms() ;
         for(Term term : terms){
             //先判定一下term是否存在，不存在就创建一个新的键值对，插入禁区，titleCount设为1
             String word = term.getName();
             WordCnt wordCnt = wordCntHashMap.get(word);
             if(wordCnt == null){
                 WordCnt wordCnt1 = new WordCnt();
                 wordCnt1.titleCount = 1;
                 wordCnt1.contentCount = 0;
                 wordCntHashMap.put(word ,wordCnt1);
             }
             //如果存在，就找到之前的值，将titleCount+1
             else{
                 wordCnt.titleCount++;
             }
         }
         //针对正文进行分词
        terms = ToAnalysis.parse(docInfo.getContent()).getTerms() ;
         for(Term term : terms){
             String word = term.getName();
             WordCnt wordCnt = wordCntHashMap.get(word);
             if(wordCnt == null){
                 WordCnt wordCnt1 = new WordCnt();
                 wordCnt1.titleCount = 0;
                 wordCnt1.contentCount = 1;
                 wordCntHashMap.put(word ,wordCnt1);
             }else{
                 wordCnt.contentCount++;
             }
         }
         //遍历上述创建的hashmap依据权重更新到倒排索引中
        for(Map.Entry<String , WordCnt> entry : wordCntHashMap.entrySet()){
            //先根据这里的词到倒排索引中查询
            //倒排拉链 ， 每一篇Doc都需要检查是否存在这个key对应的数列，有就在数列后面加上weight，没有就新建一个数列放进去
            synchronized (lock1){
                ArrayList<Weight> invertedList= invertedIndex.get(entry.getKey());
                if(invertedList == null){
                    //如果为空，就插入一个新的键值对
                    ArrayList<Weight> invertedList1 = new ArrayList<>();
                    //构建DocInfo对象插入
                    Weight weight = new Weight();
                    weight.setDocId(docInfo.getDocId());
                    weight.setWeight(entry.getValue().contentCount + (entry.getValue().titleCount*10));
                    invertedList1.add(weight);
                    invertedIndex.put(entry.getKey() , invertedList1);
                }else{
                    //不为空则构造出Weight对象插入到倒排拉链的后面
                    Weight weight = new Weight();
                    weight.setDocId(docInfo.getDocId());
                    weight.setWeight(entry.getValue().contentCount + (entry.getValue().titleCount*10));
                    invertedList.add(weight);
                }
            }
        }
    }

统计词频

首先创建一个内部类WordCnt用于保存每个词语在本篇文章中出现的次数（分别统计标题和正文），用一个HashMap表统计：

HashMap<String , WordCnt> wordCntHashMap = new HashMap<>();

先将docInfe的标题进行分词，

List<Term> terms = ToAnalysis.parse(docInfo.getTitle()).getTerms() ;

分词结果储存于数组中，遍历数组将term元素转化为string类型 .getName();

接下来

WordCnt wordCnt = wordCntHashMap.get(word);

这步操作是在hash表中寻找有无这个单词的wordCnt，没有的话就是一次都没有出现过，此时wordCnt == null 为true，此时我们就要新实例化一个wordCnt来保存这个词语出现的次数。此时因为是在标题中，所以初始化属性就是

wordCnt1.titleCount = 1;
wordCnt1.contentCount = 0;

之后将实例化的wordCnt加入到wordCntHashMap中去

如果此内容之前出现过，wordCnt就会在hash表中找到对应的实例，此时我们只需要将

wordCnt.titleCount++;

正文同理，不过需要操作wordCnt1.contentCount 元素。

构造索引

for(Map.Entry<String , WordCnt> entry : wordCntHashMap.entrySet()){
            //先根据这里的词到倒排索引中查询
            //倒排拉链 ， 每一篇Doc都需要检查是否存在这个key对应的数列，有就在数列后面加上weight，没有就新建一个数列放进去
            synchronized (lock1){
                ArrayList<Weight> invertedList= invertedIndex.get(entry.getKey());
                if(invertedList == null){
                    //如果为空，就插入一个新的键值对
                    ArrayList<Weight> invertedList1 = new ArrayList<>();
                    //构建DocInfo对象插入
                    Weight weight = new Weight();
                    weight.setDocId(docInfo.getDocId());
                    weight.setWeight(entry.getValue().contentCount + (entry.getValue().titleCount*10));
                    invertedList1.add(weight);
                    invertedIndex.put(entry.getKey() , invertedList1);
                }else{
                    //不为空则构造出Weight对象插入到倒排拉链的后面
                    Weight weight = new Weight();
                    weight.setDocId(docInfo.getDocId());
                    weight.setWeight(entry.getValue().contentCount + (entry.getValue().titleCount*10));
                    invertedList.add(weight);
                }
            }
        }

因为接下来也是操作invertedIndex是公共部分，此时就需要上锁考虑到线程安全问题。所以用synchronized进行上锁。值得注意的是这里用的锁对象和之前正排索引用的锁对象不同，这也很好理解，因为这两个方法虽然会引起线程安全问题，但他们两个没有同时修改同一对象，也就是这两个方法同时进行不会引起安全问题，但如果给两个方法上同一把锁，此时就会引起锁竞争从而降低效率。

遍历wordCntHashMap，将Map元素（本身不可遍历）转化为Entry元素进行遍历。

后面的想法和之前统计词频很像，

ArrayList<Weight> invertedList= invertedIndex.get(entry.getKey());

在invertedIndex中查找此单词key对应的数组。

如果为空，就说明还没有创建，我们进行创建，

ArrayList<Weight> invertedList1 = new ArrayList<>();

之后计算出weight，DocId就是当前操作的文件docInfo.getId()；放入到weight中。再计算出权重，

也就是前文提到的公式weight = titleCount * 10 + contentCount;

此时就得到了一个词语key和weight（包含id和weight），将weight放入invertedList1数组中，之后的文章中出现此词语就直接将weight加入到invertedList1面就可以了。将此键值对放入invertedIndex中去。

如果存在就不用创建，还是计算出权重和id放到weight中，添加到invertedList中即可。

ArrayList<Weight> invertedList= invertedIndex.get(entry.getKey());不为空

将内存索引保存到磁盘中

此时我们构建正排索引和倒排索引的代码就完成了，但是我们会发现构建索引的时间会相当长，所以我们将构建索引之后两个数据结构forwardIndex ， invertedIndex储存于文件当中，后续进行查找直接读取文件就可以了。

此时我们可以新建一个路径：

private static final String INDEX_PATH = "随便创建一个文件夹的路径";

我们加载好的文件就会保存到这个路径下

我这里使用objectMapper.writeValue将其以json格式保存到文件当中。

public void save(){
        long beg = System.currentTimeMillis();
        System.out.println("保存索引开始");
        //将两个数据结构保存到文件当中去，使用两个文件进行正排和倒排
        //1.先判断索引对应的目录是否存在，不存在就创建
        File indexFile = new File(INDEX_PATH);
        if(!indexFile.exists()){
            indexFile.mkdirs();
        }
        File forwardIndexFile = new File(INDEX_PATH + "forward.txt");
        File invertedIndexFile = new File(INDEX_PATH + "inverted.txt");
        try{
            objectMapper.writeValue(forwardIndexFile , forwardIndex);
            objectMapper.writeValue(invertedIndexFile , invertedIndex);
        }catch (IOException e){
            e.printStackTrace();
        }
        long end = System.currentTimeMillis();
        System.out.println("保存索引结束" + "消耗时间："+(end - beg));
    }

读取磁盘中文件数据：

 public void load(){
        long beg = System.currentTimeMillis();
        System.out.println("加载索引开始");
        File forwardIndexFile = new File(INDEX_PATH + "forward.txt");
        File invertedIndexFile = new File(INDEX_PATH + "inverted.txt");
        try{
            forwardIndex = objectMapper.readValue(forwardIndexFile,new TypeReference<ArrayList<DocInfo>>(){});
            invertedIndex = objectMapper.readValue(invertedIndexFile , new TypeReference<HashMap<String , ArrayList<Weight>>>(){});
        }catch (IOException e){
            e.printStackTrace();
        }
        long end = System.currentTimeMillis();
        System.out.println("加载索引结束" + "消耗时间"+(end - beg) + "ms");
    }

这里值得注意的是：

forwardIndex = objectMapper.readValue(forwardIndexFile,new TypeReference<ArrayList<DocInfo>>(){});

读取json格式的时候我们需要告诉他读取成什么类型，以此为例，我们需要将文件读取为ArrayList类型的结果，但是java中类型不能为对象也就不能进行传递，此时TypeReference<ArrayList<DocInfo>>(){}用此对象来表示类型进行传递得到符合要求的数据。

此时打开我的码云可以看到Parse代码

private void parseHTML(File file) {
        //一条搜索结果：标题 ， 描述 ， url
        //1.解析标题
        String title = parseTitle(file);
        //2.解析出url
        String url = parseUrl(file);
        //3.解析出正文
        //String content = parseContent(file);
        String content = parseContentByRegex(file);
        //将解析后的内容添加到Index的数组当中去。
        index.addDoc(title , url , content);
    }

解析文件之后index.add方法构建正排索引和倒排索引，之后

executorService.shutdown();
        index.save();
        long end = System.currentTimeMillis();
        System.out.println("索引制作完毕"+(end-start)+"ms");
    }

runByThread执行index.save()方法将两个数据结构forwardIndex ， invertedIndex储存于文件当中。

此时我们运行runByThread方法

//对执行方法进行封装，应用启动之后可以由main方法决定执行。
    public static void main(String[] args) {
        //实现制作索引的过程
        Parser parser = new Parser();
        parser.runByThread();
    }

我们在之前设置的路径下得到两个文件，这两个文件大小如下：