C++项目：基于boost在线文档实现的搜索引擎（一）

最新推荐文章于 2024-08-08 13:14:20 发布

_ 菜 -∞

最新推荐文章于 2024-08-08 13:14:20 发布

阅读量2.6k

点赞数 5

分类专栏： C/C++ 文章标签：搜索引擎 C++ boost在线文档

本文链接：https://blog.csdn.net/duchenlong/article/details/108189922

版权

C/C++ 专栏收录该内容

51 篇文章 5 订阅

订阅专栏

C++项目：基于boost在线文档实现的搜索引擎（一）

前言 - 搜索引擎的原理
目录结构与相关库的下载
模块的划分
预处理模块
预处理模块测试

下一篇：C++项目：基于boost在线文档实现的搜索引擎（二）
github: https://github.com/duchenlong/boost-search-engine

前言 - 搜索引擎的原理

在这里插入图片描述
当我们在百度的搜索框框中，输入想要搜索的关键字杰尼龟，然后很快就会出现很多和杰尼龟相关的数据，可以看到显示的数据中，所有的杰尼龟关键字都是红色文本显示，可以猜到杰尼龟应该是一个关键字，当我们搜索的时候，后台会找到所有和这个关键字有关的数据。这样就会显示所有和关键字相关的结果了。

对于提取关键字对指定文本进行分词，这一过程叫做倒排索引。他的核心就是根据一个词，映射到这个词所属的文档中（哈希表）
在这里插入图片描述

正排索引：根据文档id，得到文档的内容
倒排索引：根据文档的内容，得到文档的id

而这个项目的实现就用boost的在线文档来完成了
在这里插入图片描述

因为boost的官方文档中没有一个搜索的功能，那就可以利用下，为博客系统的搜索功能打一个小基础

在这里插入图片描述

那么倒排索引与正排索引的模块，就可以在这些官方文档的html文件中的文本来建立

在这里插入图片描述

目录结构与相关库的下载

在这里插入图片描述

common：放置公共模块的程序
data：input和output表示输入输出内容，tmp表示临时数据
jieba_dict：jieba分词的词典文件
parser：预处理模块
searcher：搜索模块，索引模块

在这里插入图片描述

对于使用到的一些库，需要下载一下：

httplib：https://github.com/yhirose/cpp-httplib
g++版本必须得是4.9以上
boost 文档：下载地址 https://sourceforge.net/projects/boost/files/boost/1.53.0/
boost 在线文档：下载地址 http://www.boost.org/doc/libs/1_53_0/doc/html/
jieba分词的库：https://github.com/yanyiwu/cppjieba
在这个库中，还需要一个limonp组件，https://gitee.com/mirrors_yanyiwu/limonp?_from=gitee_search，然后放到cppjieba的include目录下就可以了
boost： yum install boost-devel.x86_64
jsoncpp： yum install jsoncpp-devel.x86_64

下载之后，对分词进行测试与使用

在这里插入图片描述

模块的划分

在这里插入图片描述

预处理模块

读取指定html文档的内容，解析出其中的文本内容：文档标题，文档URL，文档的正文（p,span,h1…等等这种文本标签中的内容）
然后将解析的关键字的集合，整理到一起，为了之后建立索引打基础

索引模块

针对预处理模块中得到的关键字的集合，对于这些关键字，逐一构造正排索引和倒排索引，并提供一些接口给其他函数中方便调用

搜索模块

首先，搜索模块中的内容会是一个长文本，得先对个文本进行分词 （分词）

再根据分词的结果，根据倒排索引依次查找所有关键字，得到与这些关键字相关的所有文档 （触发）

之后把所有相关的文档按照一定的规则进行排序，相关性越高的文档排在前面进行显示。（排序）

最后，我们在页面上显示的时候，总不能直接显示文档的所有内容吧，可以根据标题，URL，重要的描述来构造一个显示的结果，这就需要用文档id进行正排索引，把结果封装起来发送给客户端 （构造结果）
在这里插入图片描述
变成这样，在网页上只显示文件前面的部分数据，剩余的使用...代替

服务器模块

Http服务器，给外部提供服务，使用httpliib搭建服务器

预处理模块

首先就是得到boost文档中 .html文件的路径，然后根据这些路径依次对这些文档的内容进行解析（标题，路径URL，正文），最后输出为一个行文本文件

枚举boost文档中所有 .html文件路径

这里封装一个函数，直接将所有的路径加入到vector数组中，执行失败返回false

bool GetFilePath(const string& input_path,vector<string>* file_list);

boost::filesystem 中的path类

path相关的函数名	解释
path(); path(const char* pathname); path(const std::string& pathname);	构造函数
const std::string& string( )	返回用于初始化 path 的字符串的副本
bool exists(const path&)	检查文件的扩展名。文件可以为任何类型：常规文件、目录、符号链接等等
std::string extension(const path&)：	此函数以前面带句点 (.) 的形式返回给定文件名的扩展名。例如，对于文件名为 test.cpp 的文件，extension 将返回 .cpp

boost::filesystem::recursive_directory_iterator，该迭代器可以递归的遍历指定路径下的所有文件，在遍历的途中，我们需要跳过目录以及不是.html文件的路径

在这里插入图片描述

最后所得到的路径，便是相对于doc_searcher文件的一个相对路径

bool GetFilePath(const string& input_path,vector<string>* file_list){
    namespace fs = boost::filesystem;
    fs::path root_path(input_path);
    if(fs::exists(root_path) == false){
        cout<<input_path<<" not exists"<<endl;
        return false;
    }

    fs::recursive_directory_iterator end_iter;
    for(fs::recursive_directory_iterator iter(root_path); 
        iter != end_iter; iter++){
            //当前路径为目录时，直接跳过
            if(fs::is_regular_file(*iter) == false){
                continue;
            }
            
            //当前文件不是 .html 文件，直接跳过
            if(iter->path().extension() != ".html"){
                continue;
            }

            //得到的路径加入到 vector 数组中
            file_list->push_back(iter->path().string());
        }
    
    return true;
}

根据路径遍历每个.html文件，对文件内容进行解析

直接遍历vector 数组中的每一个路径，然后依次打开文件，从中进行解析，并对解析函数进行封装

bool ParseFile(const string& file_path,DocInfo* doc_info)

在该函数中，读取文件数据使用公共函数模块中封装的Read接口，然后再分别对标题，路径，正文的解析函数进行封装。

对于解析好的数据，我们进行一个结构体的描述：

//一个文档信息的概括
struct DocInfo{
    string _title;      //文档标题
    string _url;        //文档的地址
    string _content;    //文档的正文
}

在这里插入图片描述

//找到标题  <title> </title>
bool ParseTitle(const string& html,string* title){
    size_t begin = html.find("<title>");
    if(begin == string::npos){
        cout<<"title not find"<<endl;
        return false;
    }

    size_t end = html.find("</title>",begin);
    if(end == string::npos){
        cout<<"title not find"<<endl;
        return false;
    }

    begin += string("<title>").size();
    if(begin >= end){
        cout<<"title pos info error"<<endl;
        return false;
    }

    *title = html.substr(begin,end - begin);
    return true;
}

// 本地路径形如:
// ../data/input/html/thread.html
// 在线路径形如:
// https://www.boost.org/doc/libs/1_53_0/doc/html/thread.html
bool ParseUrl(const string& file_path,string* url){
    string url_tail = file_path.substr(g_input_path.size());
    *url = g_url_head + url_tail;

    return true;
}

bool ParseContent(const string& html,string* content){
    bool is_content = true;
    for(auto c : html){
        if(is_content == true){
            if(c == '<'){
                //之后对<>中的内容进行忽略处理
                is_content = false;
            }
            else{
                if(c == '\n'){
                    c = ' ';
                }
                content->push_back(c);
            }
        }
        else{
            if(c == '>'){
                is_content = true;
            }
            //忽略标签中的内容 <a> 
        }
    }
}


bool ParseFile(const string& file_path,DocInfo* doc_info){
    string html;
    bool ret = common::Util::Read(file_path,&html);
    if(ret == false){
        cout<<file_path<< " file read error"<<endl;
        return false;
    }

    ret = ParseTitle(html,&doc_info->_title);
    if(ret == false){
        cout<<"title analysis error "<<endl;
        return false;
    }

    ret = ParseUrl(file_path,&doc_info->_url);
    if(ret == false){
        cout<<"Url analysis error "<<endl;
        return false;
    }

    ret = ParseContent(html,&doc_info->_content);
    if(ret == false){
        cout<<"content analysis error "<<endl;
        return false;
    }
    return true;
}

解析的结果写入到一个输出文件中

同理，对这个函数进行封装

void WriteOutput(const DocInfo& doc_info,std::ofstream& ofstream);

对每一个单独的html文件进行解析后的数据，我们单独写在一行中，并用一个特定的符号进行分割。这个符号不能是正文中存在的，那就可以使用一些不经常使用的字符'\3'

void WriteOutput(const DocInfo& doc_info,std::ofstream& ofstream){
    ofstream<<doc_info._title<<"\3"<<doc_info._url
            <<"\3"<<doc_info._content<<endl;
}

在这里插入图片描述

预处理模块测试

int main(){

    // 1. 得到html文件路径
    vector<string> file_list;
    bool ret = GetFilePath(g_input_path,&file_list);
    if(ret == false){
        cout<<"get html file path error"<<endl;
        return 0;
    }

    // for(auto& str : file_list){
    //     cout<<str<<endl;
    // }
    // cout<<file_list.size()<<endl;

    // 2. 遍历枚举的路径，针对每个文件进行单独处理
    std::ofstream output_file(g_output_path.c_str());
    if(output_file.is_open() == false){
        cout<<g_output_path<<" file open error"<<endl;
        return 0;
    }
    for(const auto& file_path : file_list){
        DocInfo doc_info;
        ret = ParseFile(file_path,&doc_info);
        if(ret == false){
            cout<<file_path<<" file analysis error"<<endl;
            continue;
        }
        //cout<<doc_info._title<<' '<<doc_info._url<<endl;
        // 3. 解析的文件写入到 指定的输出文件中
        WriteOutput(doc_info,output_file);
    }

    output_file.close();
    return 0;
}