项目日记(3) boost搜索引擎

HuaJiahhh

已于 2024-05-25 11:58:18 修改

阅读量520

点赞数 5

分类专栏：项目日记文章标签：搜索引擎

于 2024-05-25 11:55:10 首次发布

本文链接：https://blog.csdn.net/huajiahhhh/article/details/139176421

版权

项目日记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1. 准备工作

2. 搜索初始化

3. 搜索部分

4. 对content部分处理

5. 服务器编写

前言: 上次在项目日记(2)写了index索引, 这次就可以进行search搜索了. 不多说快看. 先点个一键三联. 蟹蟹!!!

1. 准备工作

后面需要倒排索引的结构体, 先准备好. words是后面一个文档里面出现的关键字.

    //倒排索引的结构
    struct InvertedElemPrint
    {
        uint64_t doc_id; //文档id
        int weight;      //文档权重
        vector<string> words;  //倒排关键字数组;
        InvertedElemPrint()
            :doc_id(0)
            ,weight(0)
        {}
    };

2.搜索初始化

前面创建的index进行构造单例; InitSearcher:初始化搜索, 就是创建index单例, 以及使用input文档建立索引;这些我们在index的时候都做好了直接引用即可.

class Searcher
    {
    private:
        //索引index
        ns_index::Index* index;
    public:
        Searcher(){}
        ~Searcher(){}
    public:
        //搜索初始化, input就是文档内容
        //创建单例以及索引
        void InitSearcher(const string& input)
        {
            //1.获取或者创建index对象;
            index = ns_index::Index::GetInstance();
            cout << "获取index单例成功..." << endl;
            
            //2.根据index对象建立索引;
            index->BuildIndex(input);
            cout << "建立正排倒排索引成功..." << endl;
        }

3. 搜索部分

1. 实现对query关键字进行分词; 并且存放到word里面, 前面我们写的util.hpp里面有进行分词的CutString直接使用;

2. 根据不同的分词建立索引, 因为我们在搜索的时候会有大小写, 但是结果是大小写不区分都能查出来.所以使用到boost标准库里面的to_lower接口; 根据关键词进行倒排索引, 通过倒排索引的结果填充倒排信息.

3.合并排序, 一个关键字可能对应多个文档; 根据权重进行排序;

4. 构建json, 根据查找出来的结果, 构建json串, 完成序列化和反序列化;

5. 还要对content的查找的关键字进行截取, GetDesc就是完成这个任务的.

//query是关键字, json_string返回给浏览器搜索结果.
        void Search(const string& query, string* json_string)
        {
            //1.分词;将输入的关键字进行分词.并且用word存放
            vector<string> words;
            ns_util::JiebaUtil::CutString(query, &words);

            //2.触发; 根据不同的分词进行index, 忽略大小写.
            vector<InvertedElemPrint> inverted_list_all;
            //文档id和倒排结构
            unordered_map<uint64_t, InvertedElemPrint> tokens_map;

            for(string word : words)
            {
                boost::to_lower(word);
                
                //根据分词关键字建立倒排索引, 
                ns_index::InvertedList* inverted_lsit = index->GetInvertedIndex(word);
                //建立失败, 就继续;
                if(nullptr == inverted_lsit)
                {
                    continue;
                }

                //将倒排索引的结果用item接收.插入到文档内
                for(const auto& elem : *inverted_lsit)
                {
                    auto& item = tokens_map[elem.doc_id];
                    item.doc_id = elem.doc_id;
                    item.weight += elem.weight;
                    item.words.push_back(elem.word); //文档关键字;
                }
            }

            for(const auto& item : tokens_map)
            {
                inverted_list_all.push_back(move(item.second));
            }

            //3.合并排序; 因为一个关键字可能对应多个文档id.
            //降序;
            sort(inverted_list_all.begin(), inverted_list_all.end(), \
                [](const InvertedElemPrint& e1, const InvertedElemPrint& e2)\
                {return e1.weight > e2.weight;});

            //4.构建, 根据查找出来的结果,建立json串, jsoncpp, 完成序列化和反序列化;
            //创建json对象;
            Json::Value root;
            for(auto& item : inverted_list_all)
            {
                //正排索引
                ns_index::DocInfo* doc = index->GetForwardIndex;
                if(nullptr == doc)
                {
                    continue;
                }

                Json::Value elem;
                elem["title"] = doc->title;
                elem["desc"] = GetDesc(doc->content, item.words[0]);
                elem["url"] = doc->url;

                elem["id"] = (int)item.doc_id;
                elem["weight"] = item.weight;

                root.append(elem);
            }

            Json::FastWriter writer;
            *json_string = writer.write(root);
        }

4. 对content部分处理

GetDesc用来截取关键字前后内容的, search是algorithm库里面的接口进行查找.

string GetDesc(const string& html_content, const string& word)
        {
            //找到word在html_content中首次出现, 以及前面50个和后面100个内容;
            const int prev_step = 50;
            const int next_step = 100;

            //1.找到关键词首次出现的地方;
            //tolower将大写转小写;
            auto iter = search(html_content.begin(), html_content.end(), word.begin(), word.end(), 
                    [](int x, int y){return (tolower(x) == tolower(y));});
            if(iter == html_content.end())
            {
                return "None1";
            }
            //distance返回两个迭代器的距离;
            int pos = distance(html_content.begin(), iter);

            //2.获取首次关键词前50到后100的位置;
            int start = 0;
            int end = html_content.size() - 1;

            
            if(pos > start + prev_step) start = pos - prev_step;
            if(pos < end - next_step) end = pos + next_step;

            //3.截取start和end的子串;
            if(start >= end) return "None2";
            string desc = html_content.substr(start, end - start);
            desc +="...";
            return desc; 
        }

5. 服务器编写

这里使用到httplib的库, 自己可以到gitee里面查找下载到xshell里面就可以使用了.

首先初始化搜索.使用httplib建立库, 服务端获取关键字使用search将数据给json, 再使用客户端传递json.

#include <iostream>
#include "searcher.hpp"
#include "cpp-httplib/httplib.h"

//原数据存放的地址;
const string input = "data/raw_html/raw.txt";
//目标网址.
const string root_path = "./wwwroot";

int main()
{
     ns_searcher::Searcher search;
     search.InitSearcher(input);

     //使用到httplib库.并且建立服务端.
     httplib::Server svr;
     svr.set_base_dir(root_path.c_str());
     
     //服务端获取关键字, 使用json把数据读出来.
     svr.Get("/s", [](const httplib::Request& req, httplib::Response& rsp) 
     {
         //如果没有输入搜索内容
         if(!req.has_param("word"))
         {
            rsp.set_content("必须要输入搜索的关键字!", "text/plain; charset=utf-8");
            return;
         }

         //获取关键字;
         string word = req.get_param_value("word");
         cout << "用户在搜索" << word << endl;
         string json_string;
         //进行查找.
         search.Search(word, &json_string);
         //将内容进行连接.交给服务端.
         rsp.set_content(json_string, "application/json");
     });
     
     cout << "服务器编写成功..." << endl;
     svr.listen("0.0.0.0", 8081);
     return 0;
}