Boost搜索引擎

栗悟饭&龟波气功

于 2024-08-18 17:54:10 发布

阅读量355

点赞数 5

文章标签：搜索引擎

本文链接：https://blog.csdn.net/weixin_42749846/article/details/141302918

版权

BoostSearchEngine

文章目录

BoostSearchEngine
END

项目目的:

Boost库以前是并没有搜索引擎的,查阅文档相当之麻烦,从那时起,我就有一个想法,能不能给想办法查阅Boost变得方便一些,搜索引擎无疑是最方便的.后来偶然机会,我发现从Boost下载的版本文档和Boost的wwwroot目录结构非常相似.这就让我有了一个大胆想法,我或许可以让把下载到本地的目录处理下,再对内容进行一定的提取,我可以按关键词查询Boost库相关内容,最终跳转到Boost的官方文档上去,后来我还没来的及实现,Boost已经更新出了搜索功能.此时我也还是决定完成这个想法,即便已经没必要了,也可以当做一个练习.

项目地址:https://gitee.com/RachelByers/BoostSearcher

在这里插入图片描述

项目分析:

观察我们常用的搜索引擎搜索的结果,如Bing,搜索结果都是主要三部分title,description,url.

在这里插入图片描述

后边我们的搜索也按照这个方法来展现.

技术栈需求和第三方库的使用:

本项目技术栈涉及到**C++11** 网络编程HTTP协议 Boost/filesystem库 Jsoncpp cppjieba cpp-httplib

还有一些前端应用比如HTML+CSS JQuery 但是前端并不是此次项目重点

项目环境:

Ubuntu +vscode,g++,cmake

实现原理

在这里插入图片描述

项目实现

数据清洗

因为我们在上边说我们主要分为title,description,url 三部分,此时我们就可以对这三部分进行一个处理了.

获取title

title内容一般就是<title> </title> 里的内容,我们直接使用find进行查找然后进行截取即可,不再过多赘述.

获取description

我们已经从官网下载了官方文档,里边存在了大量的HTML文件,我们知道HTML源文件里存在大量的标签,这些标签是不能作为搜索关键词的,我们需要将其清除掉保存成文件来供后续简历索引.

由于HTML 标签一般格式是成对儿出现,我们便可以设计一种状态机,当检测到标签首部出现便准备开始保留数据,但是有的标签是单标签,所以我们需要考虑到这个情况,检查下边内容是不是标签首部就变得至关重要!

实例代码:

enum State
    {
        LABEL,
        CONTENT
    };
for (char c : buffer)//buffer是整个HTML文档内容
    {
        switch (s)
        {
        case Enum::LABEL:
            if (c == '>')
                s = Enum::CONTENT;
            break;
        case Enum::CONTENT:
            if (c == '<')
            {
                s = Enum::LABEL;
                break;
            }
            if (c == '\n')
                c = ' ';
            content += c;//content是保存提取出来的string description变量
            break;
        default:
            break;
        }
    }

获取URL

我们上边提过,我们的下载的文档位置和Boost官方的wwwroot位置是有对应关系的

eg:

在这里插入图片描述

我们按照规则去拼接即可

示例代码:

bool ParseUrl(const std::string &path, DocInfo *docinfo)
{
    // eg:   src:https://www.boost.org/doc/libs/1_85_0/doc/html/function.html
    //          local:/home/Rachel/111/BoostSearcher/Data/input/html/function.html
    const std::string prev = "https://www.boost.org/doc/libs/1_85_0/doc/html/";
    const std::string local = "/home/Rachel/111/BoostSearcher/Data/input/html/";

    std::string temp = path.substr(local.size());
    const std::string realpath = prev + temp;
    docinfo->url = realpath;
    return true;
}

最后我们将每一个文件提取出来的内容类别按\3来隔开,文件使用\n来隔开,写入到Raw文件

建立索引

实现思路

正排索引:

举个例子,有一个文档,里边存了一句话 “我喜欢吃雪糕!”,另一个文档存了"雪糕喜欢吃他!".

文档ID	文档内容
1	我喜欢吃雪糕
2	雪糕喜欢吃他

对文档进行分词:

[我喜欢吃雪糕] : 我\喜欢\吃\雪糕
[雪糕喜欢吃他] :雪糕\喜欢\吃\他

倒排索引

关键词(具有唯一性)	文档ID
我	1
喜欢	1,2
吃	1
雪糕	1,2
他	2

Index模块的编写

我们将提取到的内容一次进行倒排和正排的建立,下边代码有相应的操作注释

#pragma once
#include <iostream>
#include <vector>
#include <string>
#include <unordered_map>
#include <fstream>
#include <mutex>
#include <boost/algorithm/string.hpp>
#include "Util.hpp"
#include "Log.hpp"

using namespace log_ns;

struct DocInfo
{
    uint64_t id;
    std::string title;
    std::string content;
    std::string url;
};
struct InvertElement
{
    uint64_t id;
    std::string word;
    int weight;
};

class Index
{
public:
    using InvertedList = std::vector<InvertElement>;

private:
    Index() {}
    Index(const Index &) = delete;
    Index &operator=(Index &) = delete;

public:
    ~Index() {}
    DocInfo *GetForwardIndex(uint64_t id)
    {
        if (id >= _forwardIndex.size())
        {
            lg(INFO, "Forward Indexing id failed,id out rang id:%s\n", id);
            return nullptr;
        }
        return &_forwardIndex[id];
    }
    InvertedList *GetInvertedIndex(const std::string &index)
    {
        auto ret = _invertedList.find(index);
        if (ret == _invertedList.end())
        {
            lg(INFO, "Inverted Indexing failed,index:%s\n", index.c_str());
            return nullptr;
        }
        ret->second;
        lg(DEBUG, "ret->second 没问题\n");
        lg(DEBUG, "ret->second.size:%d\n",ret->second.size());
        return &ret->second;
    }
    static Index *GetInstance()
    {
        if (_instance == nullptr)
        {
            _mutex.lock();
            if (_instance == nullptr)
            {
                _instance = new Index();
            }
            _mutex.unlock();
        }
        return _instance;
    }
    bool BuildIndex(const std::string path)
    {
        std::ifstream in(path, std::ios::in);
        std::string line;
        const std::string sep = "\3";
        int cnt=0;
        while (std::getline(in, line))
        {
            std::vector<std::string> result;
            StringUtil::StringCut(result, line, sep);
            if (result.size() != 3)
            {
                lg(WARNING, "Bulid Index Error StringCut Failed,result.size=%d \n",result.size());
                for(auto& str:result)
                {
                    std::cout<<str<<std::endl;
                    std::cout<<"==================================================="<<std::endl;
                }
                continue;
            }
            DocInfo *temp = BulidForwardIndex(result);
            // 根据正排索引构建倒排索引
            BulidInvetedIndex(temp);
            cnt++;
            if(cnt%100==0)
            {
                lg(INFO,"构建索引进度:%d\n",cnt);
            }
        }
        return true;
    }
    void BulidInvetedIndex(DocInfo *doc_info)
    {
        // 1.分词
        // 2.统计词频
        // 3....
        struct WordFrequency
        {
            int title_cnt = 0;
            int content_cnt = 0;
        };
        std::unordered_map<std::string, WordFrequency> wordsmap;
        // 先对title进行分词
        std::vector<std::string> result_title;
        JiebaUtil::SeparateWords(doc_info->title, &result_title);
        for (std::string &word : result_title)
        {
            boost::to_lower(word);
            wordsmap[word].title_cnt++;
        }
        // 对content进行分词
        const int X = 10; // 设置title的权重
        const int Y = 1;  // 设置content的权重
        std::vector<std::string> result_content;
        JiebaUtil::SeparateWords(doc_info->content, &result_content);
        for (const auto &word : result_content)
        {
            wordsmap[word].content_cnt++;
        }
        // 构建倒排索引拉链
        for (const auto &it : wordsmap)
        {
            InvertElement element;
            element.id = doc_info->id;
            element.word = it.first;
            element.weight = wordsmap[it.first].title_cnt * X + wordsmap[it.first].content_cnt * Y;
            // 插入索引
            _invertedList[element.word].push_back(std::move(element));
        }
    }
    DocInfo *BulidForwardIndex(const std::vector<std::string> &result)
    {
        DocInfo info;
        info.title = result[0];
        info.content = result[1];
        info.url = result[2];
        info.id = _forwardIndex.size();
        _forwardIndex.emplace_back(std::move(info));
        return &_forwardIndex.back();
    }

private:
    std::vector<DocInfo> _forwardIndex;                          // 正排索引
    std::unordered_map<std::string, InvertedList> _invertedList; // 倒排索引
    static Index *_instance;                                     // 单例模式
    static std::mutex _mutex;
};
Index *Index::_instance = nullptr;
std::mutex Index::_mutex;

编写搜索模块Searcher

Searcher封装成一个类,他就要具有Search功能,设计两个参数,一个输入型,一个输出型参数,输出型参数负责传出搜索结果,这里用到了jsonspp的用法,可以去了解下怎么使用的.这里这个库不再做过多赘述,因为不同的词可能存在于同一个文件里,所以我们对倒排索引进行了一次去重,去重的关键就在InvertedFinal这个类他的使用里,将同ID的倒排进行合并,权值(相关性)进行累加.

#pragma once
#include "Index.hpp"
#include "Util.hpp"
#include <algorithm>
#include <unordered_map>
#include <boost/algorithm/string.hpp>
#include <cstring>
#include <jsoncpp/json/json.h>
#include "Log.hpp"

using namespace log_ns;
const static std::string Raw = "/home/Rachel/111/BoostSearcher/Data/Raw_html/Raw";

// 用来去重invertedElement
struct InvertedFinal
{
    uint64_t id = 0;
    int weight = 0;
    std::vector<std::string> words;
};

class Searcher
{
public:
    Searcher()
    {
        InitSearcher(Raw);
    }
    ~Searcher() {}
    void InitSearcher(const std::string &input)
    {
        _index = Index::GetInstance();
        lg(INFO, "获取单例成功\n");
        _index->BuildIndex(input);
        lg(INFO, "构建索引成功\n");
    }
    std::string GetDesc(const std::string &content, const std::string &word)
    {
        // prev_gap =50
        // next_gap=100
        // 前50 后100
        int begin = 0, end = 0;
        int len = content.size();
        auto it = std::search(content.begin(), content.end(), word.begin(), word.end(), [](int e1, int e2) -> bool
                              { return std::tolower(e1) == std::tolower(e2); });
        if (it == content.end())
        {
            return "NONE";
        }
        // 从头到it
        int pos = std::distance(content.begin(), it);
        if (pos < 50)
        {
            begin = 0;
        }
        else
        {
            begin = pos - 50;
        }

        if (len < pos + 100)
        {
            end = len - 1;
        }
        else
        {
            end = pos + 100;
        }
        if (begin == std::string::npos || end - begin == std::string::npos)
        {
            lg(FATAL, "string.sub(nops!!!)\n");
            lg(FATAL, "begin:%d\n", begin);
            lg(FATAL, "end:%d\n", end);
            lg(FATAL, "end-begin:%d\n", end - begin);
        }
        std::string ret = content.substr(begin, end - begin) + "......";
        return ret;
    }
    void Search(const std::string &query, std::string *json_str)
    {
        // 1.对query进行分词
        // 2.根据分词倒排索引获得InvertedList
        // 3.对所有分词获得的InvertedList进行合并去重,并根据相关度权重排序
        // 4.对最终的InvertedList正排索引得到具体网页信息

        // 1.分词
        std::vector<std::string> words;
        JiebaUtil::SeparateWords(query, &words);
        lg(DEBUG, "query 分词成功 query:\n");
        // for debug
        for (auto &word : words)
        {
            lg(DEBUG, "%s\n", word.c_str());
        }
        // 2.1 获取InvertedList
        //Index::InvertedList invertedList;
        std::vector<InvertedFinal> invertedList;
        std::unordered_map<uint64_t, InvertedFinal> finalmap;
        for (auto &word : words)
        {
            boost::to_lower(word);
            // lg(DEBUG, "排查小写转换空格不会出问题\n");
            if (word == " ")
            {
                lg(DEBUG, "clear space\n");
                continue;
            }
            Index::InvertedList *temp = _index->GetInvertedIndex(word);
            if (temp == nullptr)
            {
                continue;
            }

            // 遍历获取到的倒排索引拉链,id相同的进行去重
            for (const auto &elem : *temp)
            {
                auto &it = finalmap[elem.id];
                it.id = elem.id;
                it.weight += elem.weight;
                it.words.push_back(elem.word);
            }
            // 至此所有id相同的倒排索引均放到了finalmap
            //invertedList.insert(invertedList.end(), temp->begin(), temp->end());
        }
        //将finalmap里的value值插入到这里构建的倒排拉链
        for(auto& it:finalmap)
        {
            invertedList.push_back(std::move(it.second));
        }

        if (invertedList.empty())
        {
            *json_str = "Not Fount";
            return;
        }
        lg(DEBUG, "invertedList size:%d\n", invertedList.size());
        // 2.2 根据权重排序 降序
        std::sort(invertedList.begin(), invertedList.end(), [](InvertedFinal e1, InvertedFinal e2) -> bool
                  { return e1.weight > e2.weight; });
        Json::Value root;
        Json::StyledWriter writer;
        for (InvertedFinal &it : invertedList)
        {
            Json::Value elem;
            DocInfo *doc = _index->GetForwardIndex(it.id);
            elem["title"] = doc->title;
            elem["desc"] = GetDesc(doc->content, it.words[0]);
            elem["url"] = doc->url;
            elem["weight"] = it.weight;
            root.append(elem);
        }
        *json_str = writer.write(root);
    }

private:
    Index *_index;
};

编写HTTP_Server模块

到这里,我们只要简单的使用下http-lib第三方库就可以了,所以我们这里实现非常简单.

实例出来一个searcher用来搜索.通过httplib的get获得从client传来的get请求,提取到word关键字.传给seacher进行搜索,搜索结果在通过输出型参数 json_resp来接受.设置根目录位置,设置监听端口即可完成操作.

#include"Searcher.hpp"
#include"Http_lib.h"
int main()
{
    Searcher* searcher=new Searcher();
    httplib::Server server;
    server.Get("/s",[&](const httplib::Request& req,httplib::Response& resp)
    {
        if(!req.has_param("word"))
        {
            resp.set_content("must have param'word'","text/plain;charset=utf-8");
            return;
        }
        std::string word=req.get_param_value("word");
        lg(INFO,"用户正在搜索%s\n",word.c_str());
        std::string json_resp;
        searcher->Search(word,&json_resp);
        resp.set_content(json_resp,"application/json");
    });
    server.set_base_dir("../wwwroot");
    
    server.listen("0.0.0.0",8000);
    return 0;
}

前端模块

前段代码部分我也不太熟悉,我东查西查随便写了点

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script src="http://code.jquery.com/jquery-2.1.1.min.js"></script>

    <title>boost 搜索引擎</title>
    <style>
        /* 去掉网页中的所有的默认内外边距，html的盒子模型 */
        * {
            /* 设置外边距 */
            margin: 0;
            /* 设置内边距 */
            padding: 0;
        }
        /* 将我们的body内的内容100%和html的呈现吻合 */
        html,
        body {
            height: 100%;
            font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif;
            background: linear-gradient(135deg, #74ebd5, #ACB6E5) fixed;
            /* background-attachment: fixed; */
        }
        /* 类选择器.container */
        .container {
            /* 设置div的宽度 */
            width: 800px;
            /* 通过设置外边距达到居中对齐的目的 */
            margin: 0px auto;
            /* 设置外边距的上边距，保持元素和网页的上部距离 */
            margin-top: 15px;
        }
        /* 复合选择器，选中container 下的 search */
        .container .search {
            /* 宽度与父标签保持一致 */
            width: 100%;
            /* 高度设置为52px */
            height: 52px;
        }


        input[type="text"] {
            float: left;
            width: 600px;
            height: 20px;
            flex: 1;
            padding: 15px;
            border: 1px solid #ddd;
            border-radius: 50px 0 0 50px;
            font-size: 1em;
            outline: none;
            box-shadow: inset 0 1px 3px rgba(0, 0, 0, 0.1);
        }
        button {
            float: left;
            padding: 15px 25px;
            border: none;
            border-radius: 0 50px 50px 0;
            background: #007bff;
            color: #fff;
            font-size: 1em;
            cursor: pointer;
            transition: background 0.3s ease, transform 0.2s ease;
        }
        button:hover {
            background: #0056b3;
            transform: translateY(-2px);
        }

        button:active {
            background: #003d7a;
            transform: translateY(0);
        }
        .container .result {
            width: 100%;
        }
        .container .result .item {
            margin-top: 15px;
        }

        .container .result .item a {
            /* 设置为块级元素，单独站一行 */
            display: block;
            /* a标签的下划线去掉 */
            text-decoration: none;
            /* 设置a标签中的文字的字体大小 */
            font-size: 20px;
            /* 设置字体的颜色 */
            color: #4e6ef2;
        }
        .container .result .item a:hover {
            text-decoration: underline;
        }
        .container .result .item p {
            margin-top: 5px;
            font-size: 16px;
            font-family:'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
        }

        .container .result .item i{
            /* 设置为块级元素，单独站一行 */
            display: block;
            /* 取消斜体风格 */
            font-style: normal;
            color: green;
        }
    </style>
</head>
<body>
    <div class="container">
        <div class="search">
            <input type="text" placeholder="请输入关键字">
            <button onclick="Search()">搜索一下</button>
        </div>
        <div class="result">
        </div>
    </div>
    <script>
        function Search(){
            // 是浏览器的一个弹出框
            // alert("hello js!");
            // 1. 提取数据, $可以理解成就是JQuery的别称
            let query = $(".container .search input").val();
            console.log("query = " + query); //console是浏览器的对话框，可以用来进行查看js数据

            //2. 发起http请求,ajax: 属于一个和后端进行数据交互的函数，JQuery中的
            $.ajax({
                type: "GET",
                url: "/s?word=" + query,
                success: function(data){
                    console.log(data);
                    BuildHtml(data);
                }
            });
        }

        function BuildHtml(data){
            // 获取html中的result标签
            let result_lable = $(".container .result");
            // 清空历史搜索结果
            result_lable.empty();

            for( let elem of data){
                // console.log(elem.title);
                // console.log(elem.url);
                let a_lable = $("<a>", {
                    text: elem.title,
                    href: elem.url,
                    // 跳转到新的页面
                    target: "_blank"
                });
                let p_lable = $("<p>", {
                    text: elem.desc
                });
                let i_lable = $("<i>", {
                    text: elem.url
                });
                let div_lable = $("<div>", {
                    class: "item"
                });
                a_lable.appendTo(div_lable);
                p_lable.appendTo(div_lable);
                i_lable.appendTo(div_lable);
                div_lable.appendTo(result_lable);
            }
        }
    </script>
</body>
</html>

运行结果

在这里插入图片描述

至此我们的这个项目也就算是基本完成了,但是后续还有可扩展的功能

比如,实际我们并没有对boost库做全部索引,因为我们的服务器性能过于孱弱了,我只对doc文件夹内的文档做了索引,实际还有很多没有索引到的.
加入热词功能,用户没搜索一次,在后台记录一下,下次搜索根据热词给出提示,供用户点击
我们还可以加入暂停词的忽略,暂停词就是了,吗,的这类词语,这类词语使用频率很高,几乎每个文档都会使用,所以没什么索引价值,分词部分我们也可以加入忽略暂停词.
网站嘛!也得恰饭呀,我们可以加入一些竞价排名,在权重的计算里加入竞价排名的因素,或者呀我们在侧边部分加入我们自己的博客链接也可以呀~ 总之我们的项目还有很多可以完善的地方.

END

栗悟饭&龟波气功

关注

5
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
0
评论
Boost搜索引擎

BoostSearchEngine文章目录BoostSearchEngine项目目的:项目分析:技术栈需求和第三方库的使用:实现原理项目实现数据清洗获取title获取description获取URL建立索引正排索引:倒排索引编写搜索模块Searcher编写HTTP_Server模块前端模块END项目目的:Boost库以前是并没有搜索引擎的,查阅文档相当之麻烦,从那时起,我就有一个想法,能不能给想办法查阅Boost变得方便一些,搜索引擎无疑是最方便的.后来偶然机会,我发现从Boost下载的版本文档和Bo
复制链接

扫一扫