【项目综合】基于 Boost 库的站内搜索引擎（保姆式讲解，小白包看包会！）

搜索引擎在使用浏览器访问网页时，几乎都会用到，如今市面上已经有很多家公司做了很多搜索引擎，例如百度、搜狗、360搜索等。这些搜索引擎，实际上是一些十分大型的项目，让计算机的初学者实现是非常非常困难的事，例如百度的搜索引擎，是可以进行全网搜索的，可以抓取全网的关键信息，并对这些信息进行存储和建立索引模块，有很高的技术门槛。因此，本篇并不涉猎全网搜索。

尽管对于计算机的初学者而言，全网搜索的实现非常非常困难，但站内搜索并非遥不可及。

站内搜索的典例，如 C++ 的标准文档：cplusplus.com（点击跳转），就支撑搜索站内的关键信息，并对搜索结果进行展示。

在 cplusplus.com 的站内搜索框内输入 string 并按下回车，即可得到站内与 string 有关的结果

相比于全网搜索，站内搜索的数据更加垂直，搜索的范围和内容都具有很强的相关性，数据量也更小。

2）Boost 库是什么

Boost 库是 C++ 的准标准库，提供了很多当前 C++ 版本没有的功能，被称之为是 C++ 的后备力量，例如 Boost 库中的哈希、智能指针等也被纳入了 C++11 中。

Boost C++ Libraries （点击跳转）是Boost 库的官网，但其中并没有和 cplusplus.com（点击跳转）一样有站内搜索框，无法搜索一个关键字，然后跳转到相关网页获取内容。

因此，做一个基于 Boost 库的站内搜索引擎，其实是有价值的。

3）搜索的结果是什么

以使用浏览器常用到的搜索引擎为例，在百度、搜狗、360 搜索输入“大学生”后，会得到以下结果：

这些搜索引擎根据关键字而搜索出的结果，基本是以“网页标题 + 网页内容摘要 + 跳转的网址”的形式来展示的。

那么，“网页标题 + 网页内容摘要 + 跳转的网址”，就可以作为本项目的站内搜索引擎的搜索结果。至于以上搜索引擎提供的搜索结果中的照片、视频、广告等，本项目并不考虑。

二、项目原理

1）宏观原理和整体流程

以全网搜索为参考，用户在客户端上一个浏览器的搜索框内输入了“大学生”，对应的服务端会返回给用户一个搜索结果：

客户端能够获取到大学生的相关信息，即网页的“标题 + 摘要 + 网址”，前提是服务端中存在相应的数据。这些数据是通过一个爬虫程序，从全网范围内将数据爬到服务端的磁盘中的。（ps：由于爬虫程序有法律限制，因此本项目不涉及任何爬虫程序，服务端中的数据是从 boost 库中的版本数据直接解压得来的，是合法的）。
客户端要访问服务端，服务端就得先在运行中，服务端一旦运行，其首要任务是对磁盘中的数据进行去标签和数据清洗的动作。这是由于，从 boost 库拿的数据，其实是对应文档的 html 网页，但要给客户端返回的结果，只是每个网页的“标题 + 网页内容摘要 + 跳转的网址”，因此就需要进行去标签和数据清洗，使用户能直接点击返回结果，然后通过网址跳转到 boost 库相应文档的位置。
服务端完成去标签和数据清洗之后，就需要对这些清洗后的数据建立索引，以便客户端快速查找。
当服务端完成了所有工作，客户端就可以发起 http 请求，通过GET方法上传搜索关键字。服务器在收到请求就会对其进行解析，通过搜索关键字去检索已构建的索引，如果找到了相应的 html，就逐个地拼接出每个网页的“标题 + 摘要 + 网址”，构建出一个新的网页，并响应给客户端。
当客户端收到了服务端的响应，用户就可以看到搜索结果，直接点击搜索结果就可以跳转到 boost 库相应的文档位置。

由此，本项目的站内搜索引擎的实现流程，就大致应该为：

获取 boost 库中的版本数据；
对数据进行去标签和数据清洗；
对清洗后的数据建立索引；
拼接网页的“标题 + 摘要 + 网址”，并返回给用户。

【Tips】站内搜索引擎的微观原理

【Tips】“基于 Boost 库的站内搜索引擎”的超详细实现流程

2）正序索引与倒序索引

假设有两个文档，其中一个文档的内容是“雷军买了四斤小米”，另一个文档的内容是“雷军发布了小米汽车”。现给这两个文档编号，就得到以下映射关系：

如果想知道雷军买了几斤小米，就需要找到相应的文档，从中获取相应的内容，例如内容为“雷军买了四斤小米”的这个文档，其文档 ID 为 1，那么就可以先找到 1 号文档，再获取其中的内容。而这就是正排索引。

简单来说，正排索引就是通过文档 ID 找到文档内容或其中的关键字。

但显然，在用户查找关键字的过程中，用户是不知道文档 ID 的，因此正序索引其实不符合用户查找关键字的过程。那么就需要用到倒序索引。

创建倒序索引，需要在创建了正序索引的基础上进行分词，而分词是根据停止词来进行的。

【Tips】停止词（stopword）

停止词是由英文单词 stopword 翻译过来的，原来在英语里面会遇到很多a，the，or等使用频率很多的字或词，常为冠词、介词、副词或连词等。

如果搜索引擎要将这些词都索引的话，那么几乎每个网站都会被索引，使得工作量巨大。

因此，为了减少构建索引的开销和提升搜索效率，前人引入了停止词，即中文的“ 了、的、吗 ”等和英文的“ a、the ”等⼀般在分词时可忽略。

将两个文档的内容分别进行分词，就得到了以下内容：

⽂档1 [ 雷军买了四⽄⼩⽶]：雷军 / 买 / 四⽄ / ⼩⽶ / 四⽄⼩⽶
⽂档2[雷军发布了⼩⽶⼿机]：雷军 / 发布 / ⼩⽶ / ⼩⽶⼿机

然后，根据这些分好的内容，创建倒序索引，即根据去重整理的各个分词，并映射到相应的文档 ID：

如此，以用户输入搜索“小米”为例，搜索过程如下：

用户在搜索框内输入“小米”；
去倒排索引中查找关键字“小米”，并提取出文档 ID【1、2】；
去正排索引中，根据文档 ID【1、2】找到文档内容；
以网页“标题 + 摘要 + 网址”的形式，构建响应结果，并返回给用户相关的网页信息。

3）所用技术栈和项目环境

技术栈：C/ C++/ C++11、STL、boost库、Jsoncpp、cppjieba、cpp-httplib、html、css、js

项目环境：Linux CentOS7 （或 ubuntu 24.04）云服务器、vim/ gcc/ g++/ Makefile、vs 2022 / vscode

【ps】小编在项目前期使用的是 CentOS7 环境，但由于 CentOS 在 2024.6.30 停运了，导致很多配置无法再使用，小编就只好中途换到了 ubuntu 24.04 来继续完成这个项目，还请读者莫怪。本篇前期在 CentOS7 上实现的所有流程，均可在 ubuntu 24.04 上进行，如果读者想自己实现一遍这个项目，请直接配置 ubuntu 的环境，以免又踩到小编踩过的坑。

4）项目源码地址（gitee）

https://gitee.com/the-driest-one-in-varoran/boost-internal-search-engine.git

三、编写数据去除标签和数据清洗模块 Parser

1）获取和清洗数据

本项目不涉及任何爬虫程序，服务端中的数据是从 boost 库中的版本数据直接解压得来的。现演示从 boost 官网获取数据。

首先进入 boost 的官网： Boost C++ Libraries（点击跳转），然后在首页界面中点击 more news。

点击 more news 进入下载包页面后，选择一个下载版本并点击相应的 Download。这里选择最新的 1.86.0 版本。

然后在跳转的页面，点击 boost_1_86_0.tar.gz 进行下载。

接下来，就是将下载好的压缩包传入 Linux CentOS7 中。这里所使用的是 Xshell 7 来远程登录 Linux CentOS7 云服务器（欲知如何安装云服务器环境，请见【Linux入门】Linux简史）。

首先打开Xshell，登录云服务器，并在用户的当前目录下创建一个 Boost_Searcher 目录，来存放本项目的相关内容。

进入 Boost_Searcher 目录，使用 rz -E 命令并按下回车，在弹出的窗口中找到 boost，点击打开即可（当然，也可以直接将压缩包图标用鼠标拖到终端中）。

然后，使用 tar xzf 命令对刚刚上传的压缩包进行解压。

接下来，对获取到的数据进行一次简单的清洗整理。

解压好的 boost 文档里面有许多文件，但并不是所有的文件都是我们需要的，我们仅需要 boost_1_86_0/doc/html 目录下的 html，这是因为 boost 库组件对应的手册内容，几乎都在该目录下以 html 网页信息的形式被保存着。

但该目录下不仅仅有 html 文件，因此需要进行数据清洗，只取出 html 文件。

这里在 Boost_Searcher 目录下创建一个 data 目录，然后在 data 目录下再创建一个 input 目录，并用 cp 命令将 boost_1_86_0/doc/html 目录下的文件全部拷贝到 input 目录下，以便后续进行去标签。

2）去标签化

什么是标签，什么又是去标签呢？这些其实是前端的概念。

进入 input 目录下，用 vim 随便打开一个 .html 的文件，可以看到如下内容：

其中，< > 就是 html 的标签，它们一般都是成对出现的，也有单独出现的。

可以看到，一个 html 文件中包含了很多标签，但这些标签本身对搜索是没有价值的，没有被“< >”引起来而被标签包围着的有效内容才对搜索有价值。因此无论是成对出现的标签还是单独出现的标签，都需要去掉它们，以提取出对搜索有价值的内容，这就是去标签化。

我们的目标是，将每个 html 文档都进行去标签，然后写入到同一个文件 raw.txt 中。而每个文档都以如下方式进行分隔：

title\3content\3url \n title\3content\3url \n title\3content\3url \n ...

如此，方便后续 getline(ifsream, line)，能够一次获取文档的全部内容：title\3content\3url。

另外，去标签的过程要特别写一个程序来完成。

这里，在 Boost_Searcher/data 目录下创建一个 raw_html 目录，并在 raw_html 目录创建一个 raw.txt 文件，用来之后存储干净的数据文档。

然后在 Boost_Searcher 目录下创建一个 parser.cc 文件，在其中编写去标签的程序代码。

3）Parser 的基本代码框架

parser.cc

#include<iostream>
#include<string>
#include<vector>
#include<boost/filesystem.hpp>
//一个存放所有html网页的目录路径
const std::string src_path="data/input/";
//一个存放有效内容的文件路径
const std::string output="data/raw_html/raw.txt";

//文档内容
typedef struct DocInfo{
    std::string title;  //文档的标题
    std::string content;//文档的内容
    std::string url;    //该文档在官网中的url
}DocInfo_t;


bool EnumFile(const std::string &src_path,std::vector<std::string> *files_list);
bool ParseHtml(const std::vector<std::string> &file_lists,std::vector<DocInfo_t> *results);
bool SaveHtml(const std::vector<DocInfo_t> &results,const std::string &output);
//ps:
//const & - 输入型参数
//* - 输出型参数
//& - 输入输出型参数

int main()
{
    //1.递归式地把每个html文件路径，保存到files_list中，
    //  方便后期对文件逐个进行读取
    std::vector<std::string> files_list;//存放文件名
    if(!EnumFile(src_path,&files_list))
    {
        std::cerr<<"enum file name error"<<std::endl;
        return 1;
    }
    //2.按照files_list读取每个文件的内容，并进行解析
    std::vector<DocInfo_t> results;
    if(!ParseHtml(files_list,&results))
    {
        std::cerr<<"parse html error"<<std::endl;
        return 2;
    }
    //3.把解析完的文件内容，全部写入到output
    //  将\3作为每个文档的分隔符
    if(!SaveHtml(results,output))
    {
        std::cerr<<"save html error"<<std::endl;
        return 3;
    }
}

【Tips】Parser 的基本代码框架

4）Parser 的代码实现细节

.1- 枚举并保存 html 文档

C++ 和 STL 对文件系统的支持不是很好，因此一般枚举文件等操作，会用到 boost 库中的 filesystem 模块。

【ps】初次使用 boost 开发库，需用指令 sudo yum install -y boost-devel 进行下载。

【ps】ubuntu：sudo apt install -y libboost-all-dev

boost 库 filesystem 模块的详细内容，可以在官网 Boost C++ Libraries （点击跳转）中查看。

要枚举并保存 html 文档，其实就是要完善上文中的 EnumFile() 。在完善之后，我们还在EnumFile() 中添加了打印信息，让 EnumFile() 打印有效的、可以保存的 html 文档，以此测试 EnumFile() 的功能。

parser.cc

#include<iostream>
#include<string>
#include<vector>
#include<boost/filesystem.hpp>
//一个存放所有html网页的目录路径
const std::string src_path="data/input/";
//一个存放有效内容的文件路径
const std::string output="data/raw_html/raw.txt";

//文档内容
typedef struct DocInfo{
    std::string title;  //文档的标题
    std::string content;//文档的内容
    std::string url;    //该文档在官网中的url
}DocInfo_t;


bool EnumFile(const std::string &src_path,std::vector<std::string> *files_list);
bool ParseHtml(const std::vector<std::string> &file_lists,std::vector<DocInfo_t> *results);
bool SaveHtml(const std::vector<DocInfo_t> &results,const std::string &output);
//ps:
//const & - 输入型参数
//* - 输出型参数
//& - 输入输出型参数

int main()
{
    //1.递归式地把每个html文件路径，保存到files_list中，
    //  方便后期对文件逐个进行读取
    std::vector<std::string> files_list;//存放文件名
    if(!EnumFile(src_path,&files_list))
    {
        std::cerr<<"enum file name error"<<std::endl;
        return 1;
    }
    //2.按照files_list读取每个文件的内容，并进行解析
    std::vector<DocInfo_t> results;
    if(!ParseHtml(files_list,&results))
    {
        std::cerr<<"parse html error"<<std::endl;
        return 2;
    }
    //3.把解析完的文件内容，全部写入到output
    //  将\3作为每个文档的分隔符
    if(!SaveHtml(results,output))
    {
        std::cerr<<"save html error"<<std::endl;
        return 3;
    }
}


bool EnumFile(const std::string &src_path,std::vector<std::string> *files_list)
{
    namespace fs=boost::filesystem;//这样可以简化作用域的书写
    fs::path root_path(src_path);  // 定义一个path对象，枚举文件就从这个路径下开始
    //判断路径是否存在，不存在则直接返回false
    if(!fs::exists(root_path))      
    {
        std::cerr<<src_path<<"not exists"<<std::endl;
        return false;
    }
    //将存在的、有效的html文档加入files_list
    fs::recursive_directory_iterator end; //定义一个空的迭代器，用于判断递归结束
    for(fs::recursive_directory_iterator iter(root_path);iter!=end;iter++)
    {
        //判断指定路径是不是普通文件，若指定路径是目录或图片则直接跳过
        if(!fs::is_regular_file(*iter))
        {
            continue;
        }
        //如果是普通文件，但不是html文件，也直接跳过
        if(iter->path().extension()!=".html")
        {
            continue;
        }
        //代码走到这里，当前路径一定是一个合法的html文件
        //于是将所有带路径的html，保存在files_list中，方便后续进行文本分析
        std::cout<<"debug: "<<iter->path().string()<<std::endl;//打印测试
        files_list->push_back(iter->path().string());
    }   
    return true;
}
bool ParseHtml(const std::vector<std::string> &file_lists,std::vector<DocInfo_t> *results)
{

    return true;
}
bool SaveHtml(const std::vector<DocInfo_t> &results,const std::string &output)
{

    return true;
}

在编写完 parser.cc 中的 EnumFile() 后，编写 Makefile 文件来对程序进行编译链接。

Makefile

CC=g++

parser:parser.cc
	$(CC) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem
.PHONY:clean
clean:
	rm -rf parser

在程序编译运行后，所有有效的 html 文档被打印了出来。

.2- 解析保存的 html 文档

要解析 EnumFile() 保存在 files_list 中的 html 文档，就是要完善 ParseHtml() 。

解析的过程，就是从 files_list 依次读取文档内容；而解析的目的，就是从每个文档的内容中提取出 html 的标题、摘要、网址，并保存到 vector<DocInfo_t> 类型的 results 中。由此，解析的实现步骤就可以分为如下：

读取文件
解析文档并提取 title
解析文档并提取 content
解析文档路径并提取 url
将解析结果放入 results 中

这里新创建一个 util.hpp 文件，其中存放了各种工具类方法，读取文件的方法也是其一。而其他解析文档的方法就放在 parse.cc 中。

util.hpp

#include<iostream>
#include<string>
#include<fstream>
namespace ns_util
{
    class FileUtil
    {
        public:
            static bool ReadFile(const std::string &file_path,std::string *out)
            {
                //1.打开文件
                std::ifstream in(file_path,std::ios::in);
                if(!in.is_open())
                {
                    std::cerr<<"open file"<<file_path<<"error"<<std::endl;
                    return false;
                }
                //2.读取文件
                std::string line;
                while(std::getline(in,line))
                {
                    *out+=line;
                }
                //3.关闭文件
                in.close();
                return true;
            }
    };
}

parse.cc

#include<iostream>
#include<string>
#include<vector>
#include<boost/filesystem.hpp>
#include"util.hpp"

//一个存放所有html网页的目录路径
const std::string src_path="data/input";
//一个存放有效内容的文件路径
const std::string output="data/raw_html/raw.txt";

//文档内容
typedef struct DocInfo{
    std::string title;  //文档的标题
    std::string content;//文档的内容
    std::string url;    //该文档在官网中的url
}DocInfo_t;


bool EnumFile(const std::string &src_path,std::vector<std::string> *files_list);
bool ParseHtml(const std::vector<std::string> &files_list,std::vector<DocInfo_t> *results);
bool SaveHtml(const std::vector<DocInfo_t> &results,const std::string &output);
//ps:
//const & - 输入型参数
//* - 输出型参数
//& - 输入输出型参数

int main()
{
    //1.递归式地把每个html文件路径，保存到files_list中，
    //  方便后期对文件逐个进行读取
    std::vector<std::string> files_list;//存放文件名
    if(!EnumFile(src_path,&files_list))
    {
        std::cerr<<"enum file name error"<<std::endl;
        return 1;
    }
    //2.按照files_list读取每个文件的内容，并进行解析
    std::vector<DocInfo_t> results;
    if(!ParseHtml(files_list,&results))
    {
        std::cerr<<"parse html error"<<std::endl;
        return 2;
    }
    //3.把解析完的文件内容，全部写入到output
    //  将\3作为每个文档的分隔符
    if(!SaveHtml(results,output))
    {
        std::cerr<<"save html error"<<std::endl;
        return 3;
    }
}


bool EnumFile(const std::string &src_path,std::vector<std::string> *files_list)
{
    namespace fs=boost::filesystem;//这样可以简化作用域的书写
    fs::path root_path(src_path);  // 定义一个path对象，枚举文件就从这个路径下开始
    //判断路径是否存在，不存在则直接返回false
    if(!fs::exists(root_path))      
    {
        std::cerr<<src_path<<"not exists"<<std::endl;
        return false;
    }
    //将存在的、有效的html文档加入files_list
    fs::recursive_directory_iterator end; //定义一个空的迭代器，用于判断递归结束
    for(fs::recursive_directory_iterator iter(root_path);iter!=end;iter++)
    {
        //判断指定路径是不是普通文件，若指定路径是目录或图片则直接跳过
        if(!fs::is_regular_file(*iter))
        {
            continue;
        }
        //如果是普通文件，但不是html文件，也直接跳过
        if(iter->path().extension()!=".html")
        {
            continue;
        }
        //代码走到这里，当前路径一定是一个合法的html文件
        //于是将所有带路径的html，保存在files_list中，方便后续进行文本分析
        //std::cout<<"debug: "<<iter->path().string()<<std::endl;//打印测试
        files_list->push_back(iter->path().string());
    }   
    return true;
}

static bool ParseTitle(const std::string &file,std::string *title) //定义为static，仅在本文件内有效
{
    return true;
}
static bool ParseContent(const std::string &file,std::string *content)
{
    return true;
}
static bool ParseUrl(const std::string &file_path,std::string *url)
{    
    return true;
}
bool ParseHtml(const std::vector<std::string> &files_list,std::vector<DocInfo_t> *results)
{
    for(const std::string &file:files_list)
    {
        //1.读取文件
        std::string result;
        if(!ns_util::FileUtil::ReadFile(file,&result))
            continue; 
        DocInfo_t doc;
        //2.解析文档并提取title
        if(!ParseTitle(result,&doc.title))
            continue;
        //3.解析文档并提取content(去标签)
        if(!ParseContent(result,&doc.content))
            continue;
        //4.解析文档路径并提取url
        if(!ParseUrl(file,&doc.url))
            continue;

        //代码走到这里，解析任务一定是完成了的
        //当前文档的解析结果都保存在了doc中

        //5.将解析结果放入 results 中
        results->push_back(doc);

    }
    return true;
}
bool SaveHtml(const std::vector<DocInfo_t> &results,const std::string &output)
{

    return true;
}

【Tips】找到文档的 title 并提取

一个 html 文档的标题，一般会被标签 “title” 引起来，形如：

<title>文档标题</tilte>

由此，只需在文档中找到标签 “title”，即可找到文档的标题；然后将标签 “title”去掉，即可获取文档的标题。

static bool ParseTitle(const std::string &file,std::string *title) //定义为static，仅在本文件内有效
{   
    //要从“<title>文档标题<\title>”中提取“文档标题”
    //只需先找到<title>和<\title>在文档中的位置
    //再找到文档标题的位置，
    //最终将文档标题提取出来即可（ps：文档标题在一个左闭右开的区间中）

    //1.定位<title>和<\title>
    std::size_t begin=file.find("<title>");
    if(begin==std::string::npos) 
        return false;
    std::size_t end=file.find("</title>");
    if(end==std::string::npos) 
        return false;
    //2.定位文档标题
    begin+=std::string("<title>").size(); //文档标题现在的位置：[begin,end)
    //3.提取文档标题
    if(begin > end)
        return false;
    *title = file.substr(begin,end-begin);
    
    return true;
}

【Tips】解析文档并提取 content

这个过程其实就是在去标签，也就是说，要将所有双标签、单标签、在标签内部的数据全部去掉，然后保留剩下的有效数据。

static bool ParseContent(const std::string &file,std::string *content)
{
    //本质就是去标签
    //去标签基于一个简易的状态机来实现

    enum status{ //枚举两种状态
        LABLE,   //标签
        CONTENT  //有效内容
    };
    enum status s=LABLE; //最初默认字符是标签
    for(char c :file)
    {
        switch(s)
        {
            case LABLE:   //当前字符的状态为标签
                if(c=='>') s=CONTENT;
                break;
            case CONTENT: //当前字符的状态为有效内容
                if(c=='<') s=LABLE;
                else {
                    if(c=='\n') c=' ';
                    content->push_back(c);
                }
                break;
            default:
                break;
        }
    }
    return true;
}

【Tips】解析文档路径并提取 url

        boost 库在网页上的 url，和我们所下载的文档的路径，其实是有对应关系的。

        例如在官网中查询 Accumulators，其 url 为：https://www.boost.org/doc/libs/1_86_0/doc/html/accumulators.html

我们先前已经将从 boost 官网下载的文档，其中所有的 html 文档，都拷贝到了 data/input/ 目录下，要在其中找到 Accumulators，查询的路径应为：data/input/accumulators.html 。

        此时，想要从我们的项目中得到和官网一样的网址，可以这样做：

取头部，拿官网的部分网址作为头部的 url，如 url_head = "https://www.boost.org/doc/libs/1_86_0/doc/html"；
取尾部，将 data/input/accumulators.html data/input 删除后得到 /accumulators.html，并将其作为尾部的 url，如 url_tail = "/accumulators.html"；
将头部和尾部拼接，成为一个完整的 url，如 url = url_head + url_tail = "https://www.boost.org/doc/libs/1_86_0/doc/html/accumulators.html"。

如此，就形成了一个可用的网页链接。
static bool ParseUrl(const std::string &file_path,std::string *url)
{    
    std::string url_head="https://www.boost.org/doc/libs/1_86_0/doc/html";
    std::string url_tail=file_path.substr(src_path.size());
    *url=url_head+url_tail;
    return true;
}

接下来我们进行一次运行测试，在解析并保存完一个文档之后，打印该文档的内容。

parse.cc

#include<iostream>
#include<string>
#include<vector>
#include<boost/filesystem.hpp>
#include"util.hpp"

//一个存放所有html网页的目录路径
const std::string src_path="data/input";
//一个存放有效内容的文件路径
const std::string output="data/raw_html/raw.txt";

//文档内容
typedef struct DocInfo{
    std::string title;  //文档的标题
    std::string content;//文档的内容
    std::string url;    //该文档在官网中的url
}DocInfo_t;


bool EnumFile(const std::string &src_path,std::vector<std::string> *files_list);
bool ParseHtml(const std::vector<std::string> &files_list,std::vector<DocInfo_t> *results);
bool SaveHtml(const std::vector<DocInfo_t> &results,const std::string &output);
//ps:
//const & - 输入型参数
//* - 输出型参数
//& - 输入输出型参数

int main()
{
    //1.递归式地把每个html文件路径，保存到files_list中，
    //  方便后期对文件逐个进行读取
    std::vector<std::string> files_list;//存放文件名
    if(!EnumFile(src_path,&files_list))
    {
        std::cerr<<"enum file name error"<<std::endl;
        return 1;
    }
    //std::cout<<"files_list has "<<files_list.size()<<std::endl;//for debug

    //2.按照files_list读取每个文件的内容，并进行解析
    std::vector<DocInfo_t> results;
    if(!ParseHtml(files_list,&results))
    {
        std::cerr<<"parse html error"<<std::endl;
        return 2;
    }
    // std::cout<<"results has "<<results.size()<<std::endl;//for debug
    // for(auto doc:results)
    // {
    //     std::cout<<"title: "<<doc.title<<std::endl;
    //     std::cout<<"content: "<<doc.content<<std::endl;
    //     std::cout<<"url: "<<doc.url<<std::endl;
    // }

    //3.把解析完的文件内容，全部写入到output
    //  将\3作为每个文档的分隔符
    if(!SaveHtml(results,output))
    {
        std::cerr<<"save html error"<<std::endl;
        return 3;
    }
}


bool EnumFile(const std::string &src_path,std::vector<std::string> *files_list)
{
    namespace fs=boost::filesystem;//这样可以简化作用域的书写
    fs::path root_path(src_path);  // 定义一个path对象，枚举文件就从这个路径下开始
    //判断路径是否存在，不存在则直接返回false
    if(!fs::exists(root_path))      
    {
        std::cerr<<src_path<<"not exists"<<std::endl;
        return false;
    }
    //将存在的、有效的html文档加入files_list
    fs::recursive_directory_iterator end; //定义一个空的迭代器，用于判断递归结束
    for(fs::recursive_directory_iterator iter(root_path);iter!=end;iter++)
    {
        //判断指定路径是不是普通文件，若指定路径是目录或图片则直接跳过
        if(!fs::is_regular_file(*iter))
        {
            continue;
        }
        //如果是普通文件，但不是html文件，也直接跳过
        if(iter->path().extension()!=".html")
        {
            continue;
        }
        //代码走到这里，当前路径一定是一个合法的html文件
        //于是将所有带路径的html，保存在files_list中，方便后续进行文本分析
        //std::cout<<"debug: "<<iter->path().string()<<std::endl;//打印测试
        files_list->push_back(iter->path().string());
    }   
    return true;
}

static bool ParseTitle(const std::string &file,std::string *title) //定义为static，仅在本文件内有效
{   
    //要从“<title>文档标题<\title>”中提取“文档标题”
    //只需先找到<title>和<\title>在文档中的位置
    //再找到文档标题的位置，
    //最终将文档标题提取出来即可（ps：文档标题在一个左闭右开的区间中）

    //1.定位<title>和<\title>
    std::size_t begin = file.find("<title>");
    if(begin == std::string::npos){
        return false;
    }
    std::size_t end = file.find("</title>");
    if(end == std::string::npos){
        return false;
    }
    //2.定位文档标题
    begin += std::string("<title>").size(); //文档标题现在的位置：[begin,end)
    //3.提取文档标题
    if(begin > end){
        return false;
    }
    *title = file.substr(begin, end - begin);

    return true;
}
static bool ParseContent(const std::string &file,std::string *content)
{
    //本质就是去标签
    //去标签基于一个简易的状态机来实现

    enum status{ //枚举两种状态
        LABLE,   //标签
        CONTENT  //有效内容
    };
    enum status s=LABLE; //最初默认字符是标签
    for(char c : file)
    {
        switch(s)
        {
            case LABLE:   //当前字符为标签
                if(c=='>') s=CONTENT;
                break;
            case CONTENT: //当前字符为有效内容
                if(c=='<') s=LABLE;
                else {
                    if(c=='\n') c=' ';//后续用\n作为html解析之后文本的分隔符，因此不保留原始文件中的\n,
                    content->push_back(c);
                }
                break;
            default:
                break;
        }
    }
    return true;
}
static bool ParseUrl(const std::string &file_path,std::string *url)
{    
    std::string url_head="https://www.boost.org/doc/libs/1_86_0/doc/html";
    std::string url_tail=file_path.substr(src_path.size());
    *url=url_head+url_tail;
    return true;
}
static void ShowDoc(const DocInfo_t &doc) //for debug
{
    std::cout<<"title: "<<doc.title<<std::endl;
    std::cout<<"content: "<<doc.content<<std::endl;
    std::cout<<"url: "<<doc.url<<std::endl;
}
bool ParseHtml(const std::vector<std::string> &files_list,std::vector<DocInfo_t> *results)
{
    for(const std::string &file:files_list)
    {
        //1.读取文件
        std::string result;
        if(!ns_util::FileUtil::ReadFile(file,&result))
            continue; 
        DocInfo_t doc;
        //2.解析文档并提取title
        if(!ParseTitle(result,&doc.title))
            continue;
        //3.解析文档并提取content(去标签)
        if(!ParseContent(result,&doc.content))
            continue;
        //4.解析文档路径并提取url
        if(!ParseUrl(file,&doc.url))
            continue;

        //代码走到这里，解析任务一定是完成了的
        //当前文档的解析结果都保存在了doc中;

        //5.将解析结果放入 results 中
        results->push_back(doc);
        //results->push_back(std::move(doc));//传入右值，可减少拷贝开销

        ShowDoc(doc);   //for debug
        break;          //for debug
    }
    return true;
}
bool SaveHtml(const std::vector<DocInfo_t> &results,const std::string &output)
{

    return true;
}

Makefile

CC=g++

parser:parser.cc
	$(CC) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem
.PHONY:clean
clean:
	rm -rf parser

测试结果如图：

.3- 将解析后的内容保存至 raw.txt 文件

要保存 ParseHtml() 解析后的内容，就是要完善 SaveHtml()。

保存的目的，就是将 ParseHtml() 解析后的内容，即保存在 vector<DocInfo_t> 类型的容器 results 中的内容，全部写入至 raw.txt 文件；而保存的过程，就是将经过解析的有效内容，按照一定格式拼接、分隔起来，然后再写入至 raw.txt 文件，这样也方便后续的读取。由此，保存的实现步骤就可以分为如下：

以 \n 作为每个文档的有效内容之间的分隔符，以 \3 作为每个文档的标题、摘要、网址之间的分隔符，将全部文档的有效内容拼接起来，形如 title\3content\3url \n title\3content\3url \n title\3content\3url \n ...
将每个文档的内容整理好，然后逐个写入到 raw.txt 文件。

具体代码如下：

bool SaveHtml(const std::vector<DocInfo_t> &results,const std::string &output)
{   
    #define SEP '\3'

    //1.打开 raw.txt
    //以二进制方式进行写入，如此，我们写什么文档就保存什么，不会做任何转义
    std::ofstream out(output,std::ios::out|std::ios::binary);
    if(!out.is_open())
    {
        std::cerr<<"open "<<output<<" failed!"<<std::endl;
        return false;
    }
    //2.写入内容
    for(auto &it:results)
    {
        //按 title \3 content \3 url \n的形式
        //拼接每个有效内容
        std::string out_string;
        out_string=it.title;
        out_string+=SEP;
        out_string+=it.content;
        out_string+=SEP;
        out_string+=it.url;
        out_string+='\n';
        //将拼接好的内容写入 raw.txt
        out.write(out_string.c_str(),out_string.size());
    }
    //3.关闭 raw.txt
    out.close();

    return true;
}

【ps】\3 和 \4 在ASSCII码表中是不可以显示的字符，用 \3 来区分文档的 title、content、url，既不会污染文档，也方便后续使用getline(ifsream, line)，直接获取文档的全部内容： title\3content\3url。

Parser 模块的完整代码如下：

util.hpp

#include<iostream>
#include<string>
#include<fstream>
namespace ns_util
{
    class FileUtil
    {
        public:
          static bool ReadFile(const std::string &file_path, std::string *out)
            {
                std::ifstream in(file_path, std::ios::in);
                if(!in.is_open()){
                    std::cerr << "open file " << file_path << " error" << std::endl;
                    return false;
                }

                std::string line;
                while(std::getline(in, line)){ //如何理解getline读取到文件结束呢？？getline的返回值是一个&，while(bool), 本质是因为重载了强制类型转化
                    *out += line;
                }

                in.close();
                return true;
            }
    };
}

parse.cc

#include<iostream>
#include<string>
#include<vector>
#include<boost/filesystem.hpp>
#include"util.hpp"

//一个存放所有html网页的目录路径
const std::string src_path="data/input";
//一个存放有效内容的文件路径
const std::string output="data/raw_html/raw.txt";

//文档内容
typedef struct DocInfo{
    std::string title;  //文档的标题
    std::string content;//文档的内容
    std::string url;    //该文档在官网中的url
}DocInfo_t;


bool EnumFile(const std::string &src_path,std::vector<std::string> *files_list);
bool ParseHtml(const std::vector<std::string> &files_list,std::vector<DocInfo_t> *results);
bool SaveHtml(const std::vector<DocInfo_t> &results,const std::string &output);
//ps:
//const & - 输入型参数
//* - 输出型参数
//& - 输入输出型参数

int main()
{
    //1.递归式地把每个html文件路径，保存到files_list中，
    //  方便后期对文件逐个进行读取
    std::vector<std::string> files_list;//存放文件名
    if(!EnumFile(src_path,&files_list))
    {
        std::cerr<<"enum file name error"<<std::endl;
        return 1;
    }
    //std::cout<<"files_list has "<<files_list.size()<<std::endl;//for debug

    //2.按照files_list读取每个文件的内容，并进行解析
    std::vector<DocInfo_t> results;
    if(!ParseHtml(files_list,&results))
    {
        std::cerr<<"parse html error"<<std::endl;
        return 2;
    }
    // std::cout<<"results has "<<results.size()<<std::endl;//for debug
    // for(auto doc:results)
    // {
    //     std::cout<<"title: "<<doc.title<<std::endl;
    //     std::cout<<"content: "<<doc.content<<std::endl;
    //     std::cout<<"url: "<<doc.url<<std::endl;
    // }

    //3.把解析完的文件内容，全部写入到output
    //  将\3作为每个文档的分隔符
    if(!SaveHtml(results,output))
    {
        std::cerr<<"save html error"<<std::endl;
        return 3;
    }
}


bool EnumFile(const std::string &src_path,std::vector<std::string> *files_list)
{
    namespace fs=boost::filesystem;//这样可以简化作用域的书写
    fs::path root_path(src_path);  // 定义一个path对象，枚举文件就从这个路径下开始
    //判断路径是否存在，不存在则直接返回false
    if(!fs::exists(root_path))      
    {
        std::cerr<<src_path<<"not exists"<<std::endl;
        return false;
    }
    //将存在的、有效的html文档加入files_list
    fs::recursive_directory_iterator end; //定义一个空的迭代器，用于判断递归结束
    for(fs::recursive_directory_iterator iter(root_path);iter!=end;iter++)
    {
        //判断指定路径是不是普通文件，若指定路径是目录或图片则直接跳过
        if(!fs::is_regular_file(*iter))
        {
            continue;
        }
        //如果是普通文件，但不是html文件，也直接跳过
        if(iter->path().extension()!=".html")
        {
            continue;
        }
        //代码走到这里，当前路径一定是一个合法的html文件
        //于是将所有带路径的html，保存在files_list中，方便后续进行文本分析
        //std::cout<<"debug: "<<iter->path().string()<<std::endl;//打印测试
        files_list->push_back(iter->path().string());
    }   
    return true;
}

static bool ParseTitle(const std::string &file,std::string *title) //定义为static，仅在本文件内有效
{   
    //要从“<title>文档标题<\title>”中提取“文档标题”
    //只需先找到<title>和<\title>在文档中的位置
    //再找到文档标题的位置，
    //最终将文档标题提取出来即可（ps：文档标题在一个左闭右开的区间中）

    //1.定位<title>和<\title>
    std::size_t begin = file.find("<title>");
    if(begin == std::string::npos){
        return false;
    }
    std::size_t end = file.find("</title>");
    if(end == std::string::npos){
        return false;
    }
    //2.定位文档标题
    begin += std::string("<title>").size(); //文档标题现在的位置：[begin,end)
    //3.提取文档标题
    if(begin > end){
        return false;
    }
    *title = file.substr(begin, end - begin);

    return true;
}
static bool ParseContent(const std::string &file,std::string *content)
{
    //本质就是去标签
    //去标签基于一个简易的状态机来实现

    enum status{ //枚举两种状态
        LABLE,   //标签
        CONTENT  //有效内容
    };
    enum status s=LABLE; //最初默认字符是标签
    for(char c : file)
    {
        switch(s)
        {
            case LABLE:   //当前字符为标签
                if(c=='>') s=CONTENT;
                break;
            case CONTENT: //当前字符为有效内容
                if(c=='<') s=LABLE;
                else {
                    if(c=='\n') c=' ';//后续用\n作为html解析之后文本的分隔符，因此不保留原始文件中的\n,
                    content->push_back(c);
                }
                break;
            default:
                break;
        }
    }
    return true;
}
static bool ParseUrl(const std::string &file_path,std::string *url)
{    
    std::string url_head="https://www.boost.org/doc/libs/1_86_0/doc/html";
    std::string url_tail=file_path.substr(src_path.size());
    *url=url_head+url_tail;
    return true;
}
static void ShowDoc(const DocInfo_t &doc) //for debug
{
    std::cout<<"title: "<<doc.title<<std::endl;
    std::cout<<"content: "<<doc.content<<std::endl;
    std::cout<<"url: "<<doc.url<<std::endl;
}
bool ParseHtml(const std::vector<std::string> &files_list,std::vector<DocInfo_t> *results)
{
    for(const std::string &file:files_list)
    {
        //1.读取文件
        std::string result;
        if(!ns_util::FileUtil::ReadFile(file,&result))
            continue; 
        DocInfo_t doc;
        //2.解析文档并提取title
        if(!ParseTitle(result,&doc.title))
            continue;
        //3.解析文档并提取content(去标签)
        if(!ParseContent(result,&doc.content))
            continue;
        //4.解析文档路径并提取url
        if(!ParseUrl(file,&doc.url))
            continue;

        //代码走到这里，解析任务一定是完成了的
        //当前文档的解析结果都保存在了doc中;

        //5.将解析结果放入 results 中
        //results->push_back(doc);
        results->push_back(std::move(doc));//传入右值，可减少拷贝开销

        //ShowDoc(doc);   //for debug
        //break;          //for debug
    }
    return true;
}
bool SaveHtml(const std::vector<DocInfo_t> &results,const std::string &output)
{   
    #define SEP '\3'

    //1.打开 raw.txt
    //以二进制方式进行写入，如此，我们写什么文档就保存什么，不会做任何转义
    std::ofstream out(output,std::ios::out|std::ios::binary);
    if(!out.is_open())
    {
        std::cerr<<"open "<<output<<" failed!"<<std::endl;
        return false;
    }
    //2.写入内容
    for(auto &it:results)
    {
        //按 title \3 content \3 url \n的形式
        //拼接每个有效内容
        std::string out_string;
        out_string=it.title;
        out_string+=SEP;
        out_string+=it.content;
        out_string+=SEP;
        out_string+=it.url;
        out_string+='\n';
        //将拼接好的内容写入 raw.txt
        out.write(out_string.c_str(),out_string.size());
    }
    //3.关闭 raw.txt
    out.close();

    return true;
}

Makefile

CC=g++

parser:parser.cc
	$(CC) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem
.PHONY:clean
clean:
	rm -rf parser

程序编译运行之后，用 vim 打开 raw.txt ，可以看到 raw.txt 中保存了解析后的 html 文档的有效内容。

四、编写建立索引的模块 Index

对数据做好了清洗和去标签之后，就要继续对数据建立索引了。此处创建一个 index.hpp 文件，来实现建立索引的模块 Index。

1）Index 的基本代码框架

index.hpp

#pragma once
#include<iostream>
#include<string>
#include<vector>
#include<unordered_map>
namespace ns_index
{
    struct DocInfo //文档内容
    {
        std::string title;  //文档标题
        std::string content;//文档的有效内容（已去标签）
        std::string url;    //官网文档url
        uint64_t doc_id;         //文档的ID
    };
    struct InvertedElem //倒排的节点元素
    {
        uint64_t doc_id;      //文档的ID
        std::string word;//关键字
        int weight;      //排序的权重
    };
    typedef std::vector<InvertedElem> InvertedList_t; //倒排拉链

    class Index //索引
    {
    private:
        //正排索引的数据结构选用数组，因为数组的天然下标可以充当文档ID
        std::vector<DocInfo> forward_index;
        //倒排索引是一个关键字和一组InvertedElem（倒排拉链）的映射关系
        std::unordered_map<std::string,InvertedList_t> inverted_index;
    public:
        Index(){}
        ~Index(){}
    public:
        //根据文档ID找到文档内容（进行正排索引）
        DocInfo *GetForwardIndex(const uint64_t &doc_id)
        {}
        //根据关键字，获得倒排拉链（进行倒排索引）
        InvertedList_t *GetInvertedList(const std::string &word)
        {}
        //根据文档的有效内容分别构建正排索引和倒排索引
        bool BuildIndex(const std::string &input)
        {}
    };
}

2）获取索引和构建索引

要进行正排索引，只需根据文档ID找到文档内容，完善的代码如下：

        //根据文档ID找到文档内容（进行正排索引）
        DocInfo *GetForwardIndex(const uint64_t &doc_id)
        {
            //若doc_id不越界，则直接返回元素 forward_index[doc_id] 即可
            if(doc_id>=forward_index.size())
            {
                std::cerr<<"doc_id out range!"<<std::endl;
                return nullptr;
            }
            return &forward_index[doc_id];
        }

要进行倒排索引，只需根据关键字，获得倒排拉链，完善的代码如下：

        //根据关键字，获得倒排拉链（进行倒排索引）
        InvertedList_t *GetInvertedList(const std::string &word)
        {
            //若word有效，则直接返回 inverted_index[word] 即可
            auto it=inverted_index.find(word);
            if(it==inverted_index.end())
            {
                std::cerr<<word<<" have no InvertedList"<<std::endl;
                return nullptr;
            }
            return &(it->second);
        }

要构建索引，就需要打开保存了 html 文档有效内容的 raw.txt 文件，按行进行读取（因为之前就将每个文档以 title\3content\3url \n 的形式存入了 raw.txt），并依次构建正排索引和倒排。

用户在使用关键字进行搜索时，我们的站内搜索引擎会先根据关键字，通过倒排索引找到文档 ID，再根据文档 ID，通过正排索引找到文档内容。而构建索引的过程，应该与用户使用关键字进行搜索的过程相反，即先构建正排索引，再构建倒排索引（原因也很简单，不先构建正排就没有文档ID，之后也构建不了倒排）。

完善的代码如下：

        DocInfo *BuildForwardIndex(const std::string &line) //构建正排索引
        {}
        bool BuildInvertedIndex(const DocInfo &doc)         //构建倒排索引
        {}
        bool BuildIndex(const std::string &input)
        {
            //1.打开文件
            //  raw.txt是以二进制方式写入的，当然也以二进制方式来读取
            std::ifstream in(input,std::ios::in | std::ios::binary);
            if(!in.is_open())
            {
                std::cerr<<input<<" open error!"<<std::endl;
                return false;
            }
            //2.按行读取
            std::string line;
            while(std::getline(in,line))
            {
                DocInfo* doc=BuildForwardIndex(line);
                if(doc==nullptr)
                {
                    std::cerr<<"build: "<<line<<" error"<<std::endl;//for debug
                    continue;
                }
                BuildInvertedIndex(*doc);
            }
            //3.关闭文件
            in.close();

            return true;
        }

编写至此，当前 Index 模块的完整代码如下：

index.hpp

#pragma once
#include<iostream>
#include<string>
#include<vector>
#include<unordered_map>
#include<fstream>
namespace ns_index
{
    struct DocInfo //文档内容
    {
        std::string title;  //文档标题
        std::string content;//文档的有效内容（已去标签）
        std::string url;    //官网文档url
        uint64_t doc_id;         //文档的ID
    };
    struct InvertedElem //倒排的节点元素
    {
        uint64_t doc_id;      //文档的ID
        std::string word;//关键字
        int weight;      //排序的权重
    };
    typedef std::vector<InvertedElem> InvertedList_t; //倒排拉链

    class Index //索引
    {
    private:
        //正排索引的数据结构选用数组，因为数组的天然下标可以充当文档ID
        std::vector<DocInfo> forward_index;
        //倒排索引是一个关键字和一组InvertedElem（倒排拉链）的映射关系
        std::unordered_map<std::string,InvertedList_t> inverted_index;
    public:
        Index(){}
        ~Index(){}
    public:
        //根据文档ID找到文档内容（进行正排索引）
        DocInfo *GetForwardIndex(const uint64_t &doc_id)
        {
            //若doc_id不越界，则直接返回元素 forward_index[doc_id] 即可
            if(doc_id>=forward_index.size())
            {
                std::cerr<<"doc_id out range!"<<std::endl;
                return nullptr;
            }
            return &forward_index[doc_id];
        }
        //根据关键字，获得倒排拉链（进行倒排索引）
        InvertedList_t *GetInvertedList(const std::string &word)
        {
            //若word有效，则直接返回 inverted_index[word] 即可
            auto it=inverted_index.find(word);
            if(it==inverted_index.end())
            {
                std::cerr<<word<<" have no InvertedList"<<std::endl;
                return nullptr;
            }
            return &(it->second);
        }
        //根据文档的有效内容分别构建正排索引和倒排索引
        bool BuildIndex(const std::string &input)
        {
            //1.打开文件
            //  raw.txt是以二进制方式写入的，当然也以二进制方式来读取
            std::ifstream in(input,std::ios::in | std::ios::binary);
            if(!in.is_open())
            {
                std::cerr<<input<<" open error!"<<std::endl;
                return false;
            }
            //2.按行读取
            std::string line;
            while(std::getline(in,line))
            {
                DocInfo* doc=BuildForwardIndex(line); //先构建正排索引
                if(doc==nullptr)
                {
                    std::cerr<<"build: "<<line<<" error"<<std::endl;//for debug
                    continue;
                }
                BuildInvertedIndex(*doc);            //再构建倒排索引
            }
            //3.关闭文件
            in.close();

            return true;
        }
    private:
        //构建正排索引
        DocInfo *BuildForwardIndex(const std::string &line)
        {}
        //构建倒排索引
        bool BuildInvertedIndex(const DocInfo &doc)
        {}
    };
}

3）构建正排索引

构建正排索引，就是要完善上文中的 BuildForwardIndex()，具体代码如下：

        //构建正排索引
        DocInfo *BuildForwardIndex(const std::string &line)
        {
            //1.解析line，进行字符串切分，提取title、content、url
            std::vector<std::string> results;
            const std::string sep="\3";
            ns_util::StringUtil::Split(line,&results,sep);
            if(results.size()!=3)
                return nullptr;
            //2.将切分好的字符串填充至一个DocInfo对象
            DocInfo doc;
            doc.title=results[0];
            doc.content=results[1];
            doc.url=results[2];
            doc.doc_id=forward_index.size(); //先保存doc_id，再插入vector，如此，doc_id就是插入后的vector的下标，即forward_index.size()-1
            //3.将填充好的DocInfo对象插入到正排索引的vector
            forward_index.push_back(doc);

            return &forward_index.back();//最终返回forward_index尾部元素的地址，即刚插入的DocInfo对象的地址
        }

其中，切分字符串的功能，具体也在存放了各种工具类方法的 util.hpp 文件中去实现：

util.hpp

#include<iostream>
#include<vector>
#include<string>
#include<fstream>
#include<boost/algorithm/string.hpp>
namespace ns_util
{
    //...

    class StringUtil
    {
    public:
        static void Split(const std::string &target,std::vector<std::string> *out,const std::string &sep)
        {
            //使用boost预备库中的split()来进行切分
            boost::split(*out,target,boost::is_any_of(sep),boost::token_compress_on);
            //第一个参数：表示将切分后的字符串放到哪里
            //第二个参数：表示待切分的字符串
            //第三个参数：表示具体的分割符是什么，不管是多个还是一个
            //第四个参数：默认可以不传，即切分的时候不压缩（也就是保留空格）；
            //          要传参的话，token_compress_on表示要压缩，token_compress_off表示不压缩。
        }
    };
}

4）构建倒排索引

.1- 步骤和原理

假设有一个文档 ID 为 123、内容为“吃葡萄不吐葡萄皮”、网址为“https://xxxxxx”的文档。在对该文档进行正排索引的构建后，其相关信息应如下：

倒排索引的原则是，按照关键字找到对应的文档 ID，因此要构建倒排索引，就需要根据文档的内容（title + content）形成一个或多个 InvertedElem（倒排拉链），以支持能够根据关键字获得倒排拉链。

    struct DocInfo //文档内容
    {
        std::string title;  //文档标题
        std::string content;//文档的有效内容（已去标签）
        std::string url;    //官网文档url
        uint64_t doc_id;    //文档的ID
    };

    struct InvertedElem //倒排的节点元素
    {
        uint64_t doc_id; //文档的ID
        std::string word;//关键字
        int weight;      //排序的权重
    };

    typedef std::vector<InvertedElem> InvertedList_t; //倒排拉链

    //倒排索引是一个关键字和一组InvertedElem（倒排拉链）的映射关系
    std::unordered_map<std::string,InvertedList_t> inverted_index;
    //本质是一个哈希桶

由于一个文档里包含多个词，且都要对应到这个文档的 ID，因此还需要对正排索引后的、文档中的 title 和 content 进行分词，分出多个关键字并将它们存入倒排拉链（vector）中，通过一个 unordered_map 即可将它们与文档 ID 都建立起映射关系。

【Tips】构建倒排的步骤及其伪代码

分词：使用 cppjieba，对文档的 title 和 content 进行分词。
词频统计：词频表示的是词和文档的相关性，对于更频繁出现、更高相关性的关键字，搜索时理应更优先地展示它们对应的文档，这也是为什么，在搜索某个关键字时，有些文档排在搜索结果的前面，有些文档则排在最后。
自定义相关性：一般可以认为，在 title 中出现的词，相关性更高一些，而在 content 中出现的词，相关性更低一些。由此，我们提高在 title 中出现的词的权重，降低在 content 中出现的词的权重。

以文档 ID 为 123、内容为“吃葡萄不吐葡萄皮”、网址为“https://xxxxxx”的文档为例，可得到以下伪代码：

1.使用 jieba 对 title 和 content进行分词

title: 吃/葡萄/吃葡萄(title_word)
content：吃/葡萄/不吐/葡萄皮(content_word)

2.词频统计

用一个结构体，来标识每个词出现在 title 和 content 中的次数。
//词频统计的结点
struct word_cnt
{
    title_cnt;  //词在标题中出现的次数
    content_cnt;//词在内容中出现的次数
}
用一个 unordered_map 将词频和关键词进行关联，使文档中的每个词都对应一个词频结构体。
//为关键字和词频建立映射关系
unordered_map<std::string, word_cnt> word_map;
 
//对title中的词进行词频统计
for(auto& word : title_word)
{
    // 一个关键词在title中出现的次数
    word_map[word].title_cnt++; //吃（1）/葡萄（1）/吃葡萄（1）
}
//对content中的词进行词频统计
for(auto& word : content_word)
{
    // 一个关键词在content中出现的次数
    word_map[word].content_cnt++; //吃（1）/葡萄（1）/不吐（1）/葡萄皮（1）
}
3.自定义相关性

知道了每个词在 title 和 content 中词频，就可以填充倒排节点的权重字段了。特别的，我们提高在 title 中出现的词的权重，降低在 content 中出现的词的权重。

此外，我们还将填充倒排节点的其他字段，并将填充好的倒排节点插入到倒排拉链中。
//遍历关键字和词频的映射，填充倒排节点的字段，并将其插入到倒排拉链中
for(auto& word : word_map)
{
    struct InvertedElem elem;//定义一个倒排节点，然后填写相应的字段
    elem.doc_id = 123;       //文档 ID
    elem.word = word.first;  //关键字
    elem.weight = 10*word.second.title_cnt + word.second.content_cnt ; //计算关键字的权重
    inverted_index[word.first].push_back(elem); // 将填充好的倒排节点插入到倒排拉链中
}
 

.2- cppjieba 的安装和使用

cppjieba 是一个分词工具，需要特别下载安装。

【Tips】获取 cppjieba

前往 https://gitcode.net页面，在搜索框内输入 cppjieba：

在搜索结果中点击如下链接：

下翻跳转后的网页，可以看到 cppjieba 的简介：

CppJieba是"结巴(Jieba)"中文分词的C++版本，源代码都写进头文件 include/cppjieba/*.hpp 里，include 即可使用。

由此，我们只需下载头文件里的 jieba.hpp 即可。

选中“克隆”，复制链接：

然后创建一个 test 目录，在 test 目录下，用指令“git clone + 链接”将文件克隆到本地：

【Tips】cppjieba 的使用演示

在克隆到本地的文件 cppjieba 中，cppjieba/include/Jieba.hpp 才是我们所要用到的，cppjieba/test/demo.cpp 则是一个包含了使用样例的文件。

进入 cppjieba/test 目录，用 vim 查看 demo.cpp ，其内容如下：

现演示使 demo.cpp 编译通过并运行。

将 demo.cpp 拷贝到 cppjieba 所在同级目录下，即 Boost_Seacher/test 下：

对 cppjieba/dict 建立一个软链接 dict，让 demo.cpp 能够找到 cppjieba/dict 下的作为分词依据的词库：

对 cppjieba/include 建立一个软链接 inc，让 demo.cpp 可以找到 cppjieba/include/cppjieba 下的头文件 Jieba.hpp：

用 vim 进入 demo.cpp，修改包含的头文件、词库等路径信息，并删除主函数的一些内容，只留下与 CutForSearch() 有关的内容：

特别的，要将 cppjieba/deps/limonp 目录拷贝到 cppjieba/include/cppjieba 目录下（主要是要拷贝Logging.hpp 文件），否则会出现编译报错：

此时使用 g++ 编译 demo.cpp 为可执行程序 demo，运行 demo，即可看到分词的演示效果：

【Tips】引入 cppjieba 到项目中

接上文，我们已经将 cppjieba 克隆到本地的 Boost_Seacher/test 目录下，并编译运行了 demo.cpp 演示分词。现在，我们要引入 cppjieba 到我们的项目中。

首先在 Boost_Seacher 目录下，对 test/cppjieba/dict 建立一个软链接 dict：

然后对 test/cppjieba/include/cppjieba 建立一个软链接 cppjieba：

接下来，我们将分词的相关代码写入到存放了各种工具类方法的 util.hpp 文件中。

util.hpp

#include<iostream>
#include<vector>
#include<string>
#include<fstream>
#include<boost/algorithm/string.hpp>
#include"cppjieba/Jieba.hpp" //引入cppjieba
namespace ns_util
{

    //...


    //词库路径：
    const char* const DICT_PATH = "./dict/jieba.dict.utf8";
    const char* const HMM_PATH = "./dict/hmm_model.utf8";
    const char* const USER_DICT_PATH = "./dict/user.dict.utf8";
    const char* const IDF_PATH = "./dict/idf.utf8";
    const char* const STOP_WORD_PATH = "./dict/stop_words.utf8";
    class JiebaUtil
    {
    private:    
        static cppjieba::Jieba jieba; 
        //将成员变量，定义为静态的，使其不必在外部每次调用CutString()时，都重新初始化一次
        //但注意，静态成员变量需要在类外初始化
        
    public:
        //1.类内的静态方法支持外部调用
        //2.要在自己内部调用静态成员的方法，自己也得是静态的
        static void CutString(const std::string &src,std::vector<std::string> *out)
        {
            jieba.CutForSearch(src,*out);
        }
    };
    cppjieba::Jieba JiebaUtil::jieba(
        DICT_PATH, 
        HMM_PATH, 
        USER_DICT_PATH, 
        IDF_PATH, 
        STOP_WORD_PATH);
}

.3- 编写倒排索引的代码

        //构建倒排索引
        bool BuildInvertedIndex(const DocInfo &doc)
        {
            struct word_cnt //统计词频
            {
                int title_cnt;
                int content_cnt;
                word_cnt():title_cnt(0),content_cnt(0){}
            };
            std::unordered_map<std::string,word_cnt> word_map;//<关键字，词频>
            //对title进行分词和词频统计
            std::vector<std::string> title_words;
            ns_util::JiebaUtil::CutString(doc.title,&title_words);
            for(auto &s:title_words)
            {    
                boost::to_lower(s); //统一转化成为小写的,这是因为一般在搜索时其实不区分大小写，例如hello、Hello、HELLO都能搜索出相同的结果
                word_map[s].title_cnt++;//operator[]能够修改容器中已存在的s的实值，对于不存在的s，会先自动创建键值s，再修改实质
            }
            //对content进行分词和词频统计
            std::vector<std::string> content_words;
            ns_util::JiebaUtil::CutString(doc.title,&content_words);
            for(auto &s:content_words)
            {
                boost::to_lower(s); //统一转化成为小写的 
                word_map[s].content_cnt++;//operator[]能够修改容器中已存在的s的实值，对于不存在的s，会先自动创建键值s，再修改实质
            }
            //自定义相关性 + 填充倒排节点
            #define X 10
            #define Y 1
            for(auto& word_pair:word_map)
            {
                //填充字段
                InvertedElem item;
                item.doc_id=doc.doc_id;
                item.word=word_pair.first;
                item.weight=X*word_pair.second.title_cnt+Y*word_pair.second.content_cnt;
                //尾插倒排拉链
                InvertedList_t &inverted_list=inverted_index[word_pair.first];
                inverted_list.push_back(std::move(item));
            }

            return true;
        }
    };

至此，建立索引的模块 Index 基本编写完成，完整代码如下：

index.hpp

#pragma once
#include<iostream>
#include<string>
#include<vector>
#include<unordered_map>
#include<fstream>
#include"util.hpp"
namespace ns_index
{
    struct DocInfo //文档内容
    {
        std::string title;  //文档标题
        std::string content;//文档的有效内容（已去标签）
        std::string url;    //官网文档url
        uint64_t doc_id;         //文档的ID
    };
    struct InvertedElem //倒排的节点元素
    {
        uint64_t doc_id;      //文档的ID
        std::string word;//关键字
        int weight;      //排序的权重
    };
    typedef std::vector<InvertedElem> InvertedList_t; //倒排拉链

    class Index //索引
    {
    private:
        //正排索引的数据结构选用数组，因为数组的天然下标可以充当文档ID
        std::vector<DocInfo> forward_index;
        //倒排索引是一个关键字和一组InvertedElem（倒排拉链）的映射关系
        std::unordered_map<std::string,InvertedList_t> inverted_index;
    public:
        Index(){}
        ~Index(){}
    public:
        //根据文档ID找到文档内容（进行正排索引）
        DocInfo *GetForwardIndex(const uint64_t &doc_id)
        {
            //若doc_id不越界，则直接返回元素 forward_index[doc_id] 即可
            if(doc_id>=forward_index.size())
            {
                std::cerr<<"doc_id out range!"<<std::endl;
                return nullptr;
            }
            return &forward_index[doc_id];
        }
        //根据关键字，获得倒排拉链（进行倒排索引）
        InvertedList_t *GetInvertedList(const std::string &word)
        {
            //若word有效，则直接返回 inverted_index[word] 即可
            auto it=inverted_index.find(word);
            if(it==inverted_index.end())
            {
                std::cerr<<word<<" have no InvertedList"<<std::endl;
                return nullptr;
            }
            return &(it->second);
        }
        //根据文档的有效内容分别构建正排索引和倒排索引
        bool BuildIndex(const std::string &input)
        {
            //1.打开文件
            //  raw.txt是以二进制方式写入的，当然也以二进制方式来读取
            std::ifstream in(input,std::ios::in | std::ios::binary);
            if(!in.is_open())
            {
                std::cerr<<input<<" open error!"<<std::endl;
                return false;
            }
            //2.按行读取
            std::string line;
            int count=0; //for debug
            while(std::getline(in,line))
            {
                DocInfo* doc=BuildForwardIndex(line); //先构建正排索引
                if(doc==nullptr)
                {
                    std::cerr<<"build: "<<line<<" error"<<std::endl;//for debug
                    continue;
                }
                BuildInvertedIndex(*doc);            //再构建倒排索引
                
                //for debug
                count++;
                if(count%1000==0)
                    std::cout<<"当前已建立的索引文档："<< count << std::endl;
            }
            //3.关闭文件
            in.close();

            return true;
        }
    private:
        //构建正排索引
        DocInfo *BuildForwardIndex(const std::string &line)
        {
            //1.解析line，进行字符串切分，提取title、content、url
            std::vector<std::string> results;
            const std::string sep="\3";
            ns_util::StringUtil::CutString(line,&results,sep);
            if(results.size()!=3)
                return nullptr;
            //2.将切分好的字符串填充至一个DocInfo对象
            DocInfo doc;
            doc.title=results[0];
            doc.content=results[1];
            doc.url=results[2];
            doc.doc_id=forward_index.size(); //先保存doc_id，再插入vector，如此，doc_id就是插入后的vector的下标，即forward_index.size()-1
            //3.将填充好的DocInfo对象插入到正排索引的vector
            forward_index.push_back(doc);

            return &forward_index.back();//最终返回forward_index尾部元素的地址，即刚插入的DocInfo对象的地址
        }
        //构建倒排索引
        bool BuildInvertedIndex(const DocInfo &doc)
        {
            struct word_cnt //统计词频
            {
                int title_cnt;
                int content_cnt;
                word_cnt():title_cnt(0),content_cnt(0){}
            };
            std::unordered_map<std::string,word_cnt> word_map;//<关键字，词频>
            //对title进行分词和词频统计
            std::vector<std::string> title_words;
            ns_util::JiebaUtil::CutString(doc.title,&title_words);
            for(auto &s:title_words)
            {    
                boost::to_lower(s); //统一转化成为小写的,这是因为一般在搜索时其实不区分大小写，例如hello、Hello、HELLO都能搜索出相同的结果
                word_map[s].title_cnt++;//operator[]能够修改容器中已存在的s的实值，对于不存在的s，会先自动创建键值s，再修改实质
            }
            //对content进行分词和词频统计
            std::vector<std::string> content_words;
            ns_util::JiebaUtil::CutString(doc.title,&content_words);
            for(auto &s:content_words)
            {
                boost::to_lower(s); //统一转化成为小写的 
                word_map[s].content_cnt++;//operator[]能够修改容器中已存在的s的实值，对于不存在的s，会先自动创建键值s，再修改实质
            }
            //自定义相关性 + 填充倒排节点
            #define X 10
            #define Y 1
            for(auto& word_pair:word_map)
            {
                //填充字段
                InvertedElem item;
                item.doc_id=doc.doc_id;
                item.word=word_pair.first;
                item.weight=X*word_pair.second.title_cnt+Y*word_pair.second.content_cnt;
                //尾插倒排拉链
                InvertedList_t &inverted_list=inverted_index[word_pair.first];
                inverted_list.push_back(std::move(item));
            }

            return true;
        }
    };
}

五、编写搜索引擎模块 Searcher

在上文中，我们已经完成了对数据进行了清洗和去标签，然后对预处理好的数据建立了索引，接下来就可以根据建立的索引去进行搜索了，因此我们要在此编写一个支持搜索功能的模块 Searcher。

首先，在我们的项目目录 Boost_Seacher 下创建一个 Searcher.hpp 头文件，然后在该头文件中编写搜索模块的代码。

1）Searcher 的基本代码框架

Searcher.hpp

#include"index.hpp"

namespace ns_searcher
{
    class Searcher
    {
    private:
        ns_index::Index *index;//供系统进行查找的索引
    public:
        Searcher(){}
        ~Searcher(){}
    public:
        //初始化Searcher模块
        void InitSearcher(const std::string &input)
        {
            //1.获取或创建index索引的单例对象（这样可以减少构建索引的工作量和内存开销，提高效率）

            //2.根据index对象建立索引
            
        }

        //进行关键字的搜索
        //  query:搜索关键字
        //  json_string:返回给用户浏览器的搜索结果
        void Search(const std::string &query,std::string *json_string)
        {
            //1.【分词】：对query按照searcher的要求进行分词

            //2.【触发】：根据分词后的各个词，进行index查找

            //3.【合并排序】：汇总搜索结果，按照相关性（权重weight）进行降序排序

            //4.【构建】：借助jsoncpp，根据搜索结果，构建json串

        }
    };
}

2）创建单例的 index 对象

单例是一种设计模式，可以保证系统中该类只有一个实例，并提供一个访问它的全局访问点，使该实例被所有程序模块共享。

我们的项目服务器要去构建索引，本质上是由 Index 模块构建一个 Index 对象，并调用其内部的方法来构建索引。而构建正排索引和倒排索引，需要将磁盘上的数据加载到内存中，当数据量较大，对内存的消耗本来也较大，如果同时存在多个 Index 对象都要构建索引的话，对内存的消耗就太大了。

因此，将 Index 模块中的 Index 类设计为单例，是很有必要的，单例的 index 对象可以保证始终只存在一个 index 对象在构建索引，这样就大大减少了内存的开销，提高了程序运行的效率。

【Tips】单例模式的实现要点：

因为全局只能有一个对象，所以需要将构造函数私有化；
用一个static静态指针（类的成员变量之一，在类外初始化）管理实例化的单例对象，并且提供一个静态成员函数，以获取这个static静态指针；
禁止拷贝，保证全局只有一个单例对象；
可以使用互斥锁来保证数据读取时的线程安全。

index.hpp

#pragma once
#include<iostream>
#include<string>
#include<vector>
#include<unordered_map>
#include<fstream>
#include"util.hpp"
#include<mutex> //c++的互斥锁
namespace ns_index
{
    struct DocInfo //文档内容
    {
        std::string title;  //文档标题
        std::string content;//文档的有效内容（已去标签）
        std::string url;    //官网文档url
        uint64_t doc_id;    //文档的ID
    };
    struct InvertedElem //倒排的节点元素
    {
        uint64_t doc_id; //文档的ID
        std::string word;//关键字
        int weight;      //排序的权重
    };
    typedef std::vector<InvertedElem> InvertedList_t; //倒排拉链

    class Index //索引
    {
    private:
        //正排索引的数据结构选用数组，因为数组的天然下标可以充当文档ID
        std::vector<DocInfo> forward_index;
        //倒排索引是一个关键字和一组InvertedElem（倒排拉链）的映射关系
        std::unordered_map<std::string,InvertedList_t> inverted_index;

    
    private:
        Index(){}                             //构造函数私有化
        Index(const Index&)=delete;           //防拷贝
        Index& operator=(const Index&)=delete;//防赋值
        static Index* instance;               //用静态指针支持在类外进行初始化
        static std::mutex mtx;                //互斥锁保证数据读取时的线程安全
    public:
        ~Index(){}
    public:
        //获取或创建Index对象
        static Index* GetInstance()
        {
            if(instance==nullptr) //Index单例对象不存在,就在临界区内创建一个
            {
                mtx.lock();  //加锁
                if(instance == nullptr) //Index单例对象不存在，就new创建一个
                    instance=new Index();
                mtx.unlock();//解锁
            }
            return instance;
        }


        //构建正排、倒排的细节（见上文）
        //...

    };
    Index* Index::instance=nullptr;//类外初始化
}

至此，就可以在 Searcher 模块中使用 index 单例对象了，换句话说就是要完善 Searcher 模块中的 InitSearcher()。

Searcher.hpp

#include"index.hpp"

namespace ns_searcher
{
    class Searcher
    {
    private:
        ns_index::Index *index;//供系统进行查找的索引
    public:
        Searcher(){}
        ~Searcher(){}
    public:
        void InitSearcher(const std::string &input)
        {
            //1.获取或创建index索引对象
            index = ns_index::Index::GetInstance();
            //2.根据index对象建立索引
            index->BuildIndex(input);
        }

        //...

    };
}

3）jsoncpp 的安装和使用

jsoncpp 是一个支持序列化和反序列化的第三方库，既可以支持根据搜索结果构建 json 串，也可以更好支持我们的项目服务器向客户端合理地发送数据。

【ps】在普通用户下输入指令 sudo yum install -y jsoncpp-devel，或在 root 用户下输入指令 yum install -y jsoncpp-devel 安装第三方库 jsoncpp。

【ps】ubuntu 下安装 jsoncpp：sudo apt-get install -y libjsoncpp-dev。

现演示 jsoncpp 的用法。

在 Boost_Searcher/test 目录下创建一个 test_json.cc 文件用于演示，test_json.cc 中的代码如下：

#include<iostream>
#include<string>
#include<vector>
#include<jsoncpp/json/json.h>

//Value:在序列化和反序列化之间做转换
//Reader:序列化
//Writer:反序列化
int main()
{
    Json::Value root;
    Json::Value item1;
    item1["key1"]="value11";
    item1["key2"]="value22";

    Json::Value item2;
    item2["key1"]="value1";
    item2["key2"]="value2";

    root.append(item1);
    root.append(item2);

    Json::StyledWriter writer;
    std::string s=writer.write(root);
    std::cout<< s <<std::endl;
}

使用指令 g++ test_json.cc -o test_json -ljsoncpp 编译程序并运行：

4）编写搜索功能

编写搜索功能，即完善 Searcher 模块中的 Search()。

Searcher.hpp

#pragma once
#include"index.hpp"
#include"util.hpp"
#include<algorithm>
#include<jsoncpp/json/json.h>
namespace ns_searcher
{
    class Searcher
    {
    private:
        ns_index::Index *index;//供系统进行查找的索引
    public:
        Searcher(){}
        ~Searcher(){}
    public:
        //初始化Searcher模块
        void InitSearcher(const std::string &input)
        {
            //1.获取或创建index索引对象
            index = ns_index::Index::GetInstance();
            std::cout<<"单例获取成功..."<<std::endl; //for debug
            //2.根据index对象建立索引
            index->BuildIndex(input);
            std::cout<<"索引构建成功..."<<std::endl; //for debug
        }


        //进行关键字的搜索
        //  query:搜索关键字
        //  json_string:返回给用户浏览器的搜索结果
        void Search(const std::string &query,std::string *json_string)
        {
            //1.【分词】：对query按照searcher的要求进行分词
            std::vector<std::string> words;
            ns_util::JiebaUtil::CutString(query,&words);
            //2.【触发】：根据分词后的各个词，进行index查找
            ns_index::InvertedList_t inverted_list_all; //保存所有倒排索引的结果
            for(std::string word:words)
            {
                boost::to_lower(word);//与构建索引时一样，搜索时也要统一为小写
                //先进行倒排索引
                ns_index::InvertedList_t *inverted_list=index->GetInvertedList(word);
                if(inverted_list==nullptr)
                    continue; //就算本次倒排没找到，也继续找
                inverted_list_all.insert(inverted_list_all.end(),inverted_list->begin(),inverted_list->end()); //找到了，将结果批量化地插入inverted_list_all
                
            }
            //3.【合并排序】：汇总搜索结果，按照相关性（权重weight）进行降序排序
            std::sort(inverted_list_all.begin(),inverted_list_all.end(),
                [](const ns_index::InvertedElem &e1,const ns_index::InvertedElem &e2){
                    return e1.weight>e2.weight;
                });
            //4.【构建】：借助jsoncpp，根据搜索结果，构建json串
            Json::Value root;
            //进行正排索引
            for(ns_index::InvertedElem &item : inverted_list_all)
            {
                ns_index::DocInfo *doc=index->GetForwardIndex(item.doc_id);
                if(doc==nullptr)
                    continue; //就算本次正排没找到，也继续找
                
                Json::Value elem;
                elem["title"]=doc->title;
                elem["desc"]=doc->content;
                elem["url"]=doc->url;

                //for debug
                elem["id"]=(int)item.doc_id;
                elem["weight"]=item.weight;

                root.append(elem);
            }
            Json::StyledWriter writer;
            *json_string=writer.write(root);
        }
    };
}

5）编写测试代码

至此，我们可以在项目目录 Boost_Seacher 下创建一个 debug.cc 源文件，在其中编写一些代码来测试上文中写好的模块。

先获取我们已经预处理好的数据，其所在文件的路径：data/raw_html/raw.txt。

debug.cc

#include<cstring>
const std::string input="data/raw_html/raw.txt";

int main()
{
    ns_searcher::Searcher *searcher = new ns_searcher::Searcher();
    searcher->InitSearcher(input);
    std::string query;
    std::string json_string;
    char buffer[1024];
    while(1)
    {
        std::cout<<"Please Enter your Search Query"<<std::endl;
        fgets(buffer,sizeof(buffer)-1,stdin);
        buffer[strlen(buffer)-1]=0;//去“\n”
        query=buffer;
        searcher->Search(query,&json_string);
        std::cout<< json_string <<std::endl;
    }

    return 0;
}

Makefile

PARSER=parser
DUG=debug
CC=g++

.PHONY:all
all:$(PARSER) $(DUG)

$(PARSER):parser.cc
	$(CC) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem
$(DUG):debug.cc
	$(CC) -o $@ $^ -std=c++11 -ljsoncpp
.PHONY:clean
clean:
	rm -rf $(PARSER) $(DUG)

代码编译后运行效果如下：

输入关键字 split，按下回车就可以看到搜索结果：

尽管搜索结果展示了网页的标题、内容、网址，但网页内容的信息是非常庞大且冗余的，也特别不美观。

实际上我们根本不需要那么多的网页内容，只需要一些摘要内容即可。因此，接下来我们还需要编写一个获取网页内容摘要的功能。

6）编写获取网页摘要的功能

获取网页内容摘要的功能，作为一个函数，编写在 Searcher.hpp 中。

Searcher.hpp

#pragma once

#include"index.hpp"
#include"util.hpp"
#include<algorithm>
#include<jsoncpp/json/json.h>
namespace ns_searcher
{
    class Searcher
    {
    private:
        ns_index::Index *index;//供系统进行查找的索引
    public:
        Searcher(){}
        ~Searcher(){}
    public:
        //初始化Searcher模块
        //...

        //进行关键字的搜索
        void Search(const std::string &query,std::string *json_string)
        {
            //1.【分词】：对query按照searcher的要求进行分词
            //...

            //2.【触发】：根据分词后的各个词，进行index查找
            //...

            //3.【合并排序】：汇总搜索结果，按照相关性（权重weight）进行降序排序
            //...

            //4.【构建】：借助jsoncpp，根据搜索结果，构建json串
            Json::Value root;
            for(ns_index::InvertedElem &item : inverted_list_all)
            {
                ns_index::DocInfo *doc=index->GetForwardIndex(item.doc_id);
                if(doc==nullptr)
                    continue;
                
                Json::Value elem;
                elem["title"]=doc->title;
                elem["desc"]=GetDesc(doc->content,item.word); //不再填充全部内容，而是填充摘要
                elem["url"]=doc->url;

                root.append(elem);
            }
            Json::StyledWriter writer;
            *json_string=writer.write(root);
        }

        //获取网页摘要
        std::string GetDesc(const std::string &html_content,const std::string &word)
        {
            //找到word在html_content中首次出现的位置
            //然后往前找50字节（不足就从头开始），往后找100字节（不足就到结尾）
            //并截取出这部分内容，作为摘要返回

            //1.找到首次出现的位置
            auto iter=std::search(html_content.begin(),html_content.end(),word.begin(),word.end(),
                [](int x,int y){
                    return (std::tolower(x)==std::tolower(y));
            });
            if(iter==html_content.end())//这种情况一般是不存在的,除非文档内容里真的没有关键字，那也说明代码有bug
                return "None";
            int pos=std::distance(html_content.begin(),iter);

            //2.获取start、end
            const int prev_step =50;
            const int next_step=100;
            int start=0;                  //默认在文档开头
            int end=html_content.size()-1;//默认在文档结尾
            if(pos > start+prev_step)
                start=pos-prev_step; //更新位置
            if(pos < end-next_step)
                end=pos+next_step;   //更新位置

            //3.截取字符串
            if(start >= end) //这种情况一般也是不存在的，除非数据类型有问题、代码有bug
                return "None";
            std::string desc=html_content.substr(start,end-start);
            desc+="...";
            return desc;
        }
    };
}

此时重新编译程序并运行后，输入关键字 split 可以发现展示的搜索结果美观、简洁了许多。

六、编写 http_server 模块

在项目目录 Boost_Seacher 下创建一个 http_server.cc 源文件，我们将在这个源文件中编写网络服务端模块。

1）cpp-httplib 的安装和使用

【ps】使用 cpp-httplib 需升级 centOS7 下默认的 4.8.5 版本的 gcc 编译器：

sudo yum install centos-release-scl sclutils-build（安装scl）
sudo yum install -y devtoolset-9-gcc devtoolset-9-gcc-c++（安装新版的 gcc）

【ps】ubuntu下安装较高版本的 gcc、g++：sudo apt install -y gcc-9 g++-9

安装 cpp-httplib：

cpp-httplib 的下载地址：cpp-httplib: cpp-httplib

进入跳转页面后，点击标签选择 0.7.15 版本，点击下载 zip 文件。

进入先前创建的 Boost_Searcher/test/ 目录，将下载好的压缩包拖拽进 xshell 终端中。

使用 unzip 指令解压该压缩包。

在解压后的文件中，httplib.h 头文件是我们实际所需的。

回到 Boost_Searcher 目录下，对 test/cpp-httplib-v0.7.15/ 建立一个软链接 cpp-httplib。

至此，我们就可以在项目中使用 cpp-httplib 了。

cpp-httplib 的使用：

我们在先前创建的 http_server.cc 源文件中编写代码来演示 cpp-httplib 的使用。

http_server.cc

//#include"Searcher.hpp"
#include"cpp-httplib/httplib.h"

int main()
{
    //创建一个Server对象，本质就是搭建服务端
    httplib::Server svr;
    //根据get方法获取的http请求，构建http响应
    svr.Get("/hi",[](const httplib::Request &req,httplib::Response &rsp){
        rsp.set_content("hello world!","text/plain;charset=utf-8");
    });
    // 绑定端口（8888），启动监听（0.0.0.0表示监听任意端口）
    svr.listen("0.0.0.0",8888);
    return 0;
}

Makefile

PARSER=parser
DUG=debug
HTTP_SERVER=http_server
CC=g++-9

.PHONY:all
all:$(PARSER) $(DUG) $(HTTP_SERVER)

$(PARSER):parser.cc
	$(CC) -o $@ $^ -std=c++11 -lboost_system -lboost_filesystem
$(DUG):debug.cc
	$(CC) -o $@ $^ -std=c++11 -ljsoncpp
$(HTTP_SERVER):http_server.cc
	$(CC) -o $@ $^ -std=c++11 -ljsoncpp -lpthread
.PHONY:clean
clean:
	rm -rf $(PARSER) $(DUG) $(HTTP_SERVER)

编译并运行可执行 http_server，并打开浏览器，进入“云服务器IP:8888/hi”，可以看到浏览器显示了我们在代码中构建的 http 响应。

但是，直接访问“云服务器IP:8888”却找不到相应页面。

一般来说，在访问一个网站时，该网站会反馈一个首页，例如 www.baidu.com。同样的，访问“云服务器IP:8888”也应该显示一个首页才对。

至此，在 Boost_Searcher 目录下创建一个 wwwroot 目录，该目录作为 web 根目录并负责存放首页信息。

在 wwwroot 目录下创建一个 index.html 文件。

index.html 文件中就包含了“云服务器IP:8888”的首页信息。

index.html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>boost搜索引擎</title>
</head>
<body>
    <h1>欢迎来到我的世界</h1>
</body>
</html>
修改 http_server.cc 并重新进行编译。

http_server.cc
//#include"Searcher.hpp"
#include"cpp-httplib/httplib.h"

const std::string root_path="./wwwroot";//web根目录的路径

int main()
{
    //创建一个Server对象，本质就是搭建服务端
    httplib::Server svr;
    
    //访问首页
    svr.set_base_dir(root_path.c_str());

    //根据get方法获取的http请求，构建http响应
    svr.Get("/hi",[](const httplib::Request &req,httplib::Response &rsp){
        rsp.set_content("hello world!","text/plain;charset=utf-8");
    });
    // 绑定端口（8888），启动监听（0.0.0.0表示监听任意端口）
    svr.listen("0.0.0.0",8888);
    return 0;
}
运行程序后，用浏览器登录“云服务器IP:8888”，即可看到 index.html 中的首页信息了。

2）完善 http 调用

上面已经演示了 cpp-httplib 的引入和使用，这里就继续完善网络服务端模块

http_server

#include"Searcher.hpp"
#include"cpp-httplib/httplib.h"
const std::string input="data/raw_html/raw.txt";//索引数据源
const std::string root_path="./wwwroot";//web根目录的路径

int main()
{
    //初始化索引
    ns_searcher::Searcher search;
    search.InitSearcher(input);

    //创建一个Server对象，本质就是搭建服务端
    httplib::Server svr;
    
    //访问首页
    svr.set_base_dir(root_path.c_str());

    //根据get方法获取的http请求，构建http响应
    svr.Get("/s",[&search](const httplib::Request &req,httplib::Response &rsp){
        if(!req.has_param("word"))//检测用户的请求中是否有搜索关键字
        {
            rsp.set_content("必须要有搜索关键字！","text/plain;charset=utf-8");
            return;
        }
        //获取用户输入的关键字
        std::string word=req.get_param_value("word");
        std::cout<<"用户在搜索："<< word <<std::endl; //for debug
        //根据关键字，构建json串
        std::string json_string;
        search.Search(word,&json_string);
        //设置 get "s" 请求返回的内容，返回的是根据关键字，构建json串内容
        rsp.set_content(json_string,"application/json");
    });

    // 绑定端口（8888），启动监听（0.0.0.0表示监听任意端口）
    svr.listen("0.0.0.0",8888);
    return 0;
}

编译运行后，在浏览器上输入“云服务器IP:8888/s” 再加上要搜索的关键字 split，即可看到浏览器返回的 json 串。

网址的完整形式为“云服务器IP:8888/s？word=split”。

如果没有搜索关键字，浏览器就会根据我们的代码去显示以下信息：

七、编写前端模块

要让浏览器展示一个相对美观、正式的网页，就需要前端模块，其中：

html: 是⽹⻚的⻣骼 -- 负责⽹⻚结构
css：⽹⻚的⽪⾁ -- 负责⽹⻚美观的
js（javascript）：⽹⻚的灵魂---负责动态效果，和前后端交互

【ps】更多前端的教程和相关文档，详情请见：www.w3school.com.cn/

下面我们对网页的结构、外观、动态效果、前后端交互来一一进行实现。

1）编写网页结构：html

既然是站内搜索引擎，网页结构中当然需要一个搜索框和搜索按钮，且我们的搜索结果是按照网页的标题、摘要、网址的形式来展示的，由此，我们的网页结构大致如下图：

我们将网页的结构编写到 Boost_Searcher/wwwroot 目录下的 index.html 文件中。

index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Boost 搜索引擎</title>
</head>
<body>
    <div class="container">
        <div class="search">
            <input type="text" value="请输入搜索关键字...">
            <button>搜索一下</button>
        </div>
        <div class="result">
            <div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要xxxxx</p>
                <i>这是url</i>
            </div>
            <div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要xxxxx</p>
                <i>这是url</i>
            </div>
            <div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要xxxxx</p>
                <i>这是url</i>
            </div>
        </div>
    </div>
</body>
</html>

编译并运行我们的项目代码，用浏览器访问“云服务器ip:8888”，就可以看到如下网页：

不过，这样的页面显然是太粗糙了，因此我们还需要通过 css 来对网页的外观做美化。

2）编写网页外观：css

我们继续在网页结构的基础上用 css 编写网页的外观。

index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Boost 搜索引擎</title>
    <style>
        /*去掉所有默认的内外边距*/
        *{
            margin: 0; /*外边距*/
            padding: 0;/*内边距*/
        }
        /*将body中的内容与html的呈现100%吻合*/
        html,
        body{
            height: 100%;/*高度*/
        }
        /*以.开头的类选择器，可以编辑相应类的外观*/
        .container{
            width: 800px;/*宽度*/
            margin: 0px auto;/*上下外边距为0，左右自动对齐 */
            margin-top: 15px;/*顶部外边距为15，保持元素和网页顶部的距离*/
        }
        /* 复合选择器，选择container下的search */
        .container .search{
            width: 100%;/*宽度与父标签保持一致*/
            height: 52px;/*高度设置为52像素点*/
        }
        /* 选择搜索框 input，直接设置标签的属性 （单独的input：标签选择器）*/
        .container .search input{
            float: left;/*左浮动，与搜索按钮拼接在一起*/
            width: 600px;
            height: 50px;
            border: 1px solid black;/*边框的属性，宽度、样式、颜色*/
            border-right: none;/*取消边框右边部分*/
            padding-left: 10px;/*左内边距*/
            color: #ccc;/*字体颜色*/
            font-size: 14px;/*字体大小*/
        }
        /* 选择搜索按钮 button （单独的button：标签选择器）*/
        .container .search button{
            float: left;/*左浮动，与搜索按钮拼接在一起*/
            width: 150px;
            height: 52px;/*与搜索框对齐*/
            background-color: #4e6ef2;/*搜索按钮的背景颜色*/
            color: #FFF;/*字体颜色*/
            font-size: 19px;/*字体大小*/
            font-family: Georgia, 'Times New Roman', Times, serif;/*字体风格*/
        }
        /*选择 result*/
        .container .result{
            width: 100%;
        }
        /*选择 item*/
        .container .result .item{
            margin-top: 15px;
        }
        /*选择 item下的a、p、i标签*/
        .container .result .item a{
            display: block;/*设置为块级元素，单独占一行*/
            text-decoration: none;/*去掉下划线*/
            font-size: 20px;/*设置标题的字体大小*/
            color: #4e6ef2;/*字体颜色*/
        }
        .container .result .item p{
            font-size: 16px;
            font-family: 'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
            margin-top: 5px;
        }
        .container .result .item i{
            display: block;/*设置为块级元素，单独占一行*/
            font-style: normal;/*取消斜体*/
            color: green;
        }
    </style>
</head>
<body>
    <div class="container">
        <div class="search">
            <input type="text" value="请输入搜索关键字...">
            <button>搜索一下</button>
        </div>
        <div class="result">
            <div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要xxxxx</p>
                <i>这是url</i>
            </div>
            <div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要xxxxx</p>
                <i>这是url</i>
            </div>
            <div class="item">
                <a href="#">这是标题</a>
                <p>这是摘要xxxxx</p>
                <i>这是url</i>
            </div>
        </div>
    </div>
</body>
</html>

编译并运行我们的项目代码，用浏览器访问“云服务器ip:8888”，就可以看到如下网页：

3）前后端交互：JS

为了能在搜索框内输入搜索关键字后，点击搜索按钮能够正常进行正常搜索，我们还需要用 Javascript 来实现前后端的交互。

由于原生的 JS 的使用成本较高，因此这里推荐直接使用 JQuery，类似于引入了一个第三方库。

【ps】欲知 JQuery 的使用详情，请见： www.jq22.com/cdn/

index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <!-- /* 引入JQuery库 */ -->
    <script src="http://code.jquery.com/jquery-2.1.1.min.js"></script> 
    <title>Boost 搜索引擎</title>
    <style>
        /*去掉所有默认的内外边距*/
        *{
            margin: 0; /*外边距*/
            padding: 0;/*内边距*/
        }
        /*将body中的内容与html的呈现100%吻合*/
        html,
        body{
            height: 100%;/*高度*/
        }
        /*以.开头的类选择器，可以编辑相应类的外观*/
        .container{
            width: 800px;/*宽度*/
            margin: 0px auto;/*上下外边距为0，左右自动对齐 */
            margin-top: 15px;/*顶部外边距为15，保持元素和网页顶部的距离*/
        }
        /* 复合选择器，选择container下的search */
        .container .search{
            width: 100%;/*宽度与父标签保持一致*/
            height: 52px;/*高度设置为52像素点*/
        }
        /* 选择搜索框 input，直接设置标签的属性 （单独的input：标签选择器）*/
        .container .search input{
            float: left;/*左浮动，与搜索按钮拼接在一起*/
            width: 600px;
            height: 50px;
            border: 1px solid black;/*边框的属性，宽度、样式、颜色*/
            border-right: none;/*取消边框右边部分*/
            padding-left: 10px;/*左内边距*/
            color: #ccc;/*字体颜色*/
            font-size: 14px;/*字体大小*/
        }
        /* 选择搜索按钮 button （单独的button：标签选择器）*/
        .container .search button{
            float: left;/*左浮动，与搜索按钮拼接在一起*/
            width: 150px;
            height: 52px;/*与搜索框对齐*/
            background-color: #4e6ef2;/*搜索按钮的背景颜色*/
            color: #FFF;/*字体颜色*/
            font-size: 19px;/*字体大小*/
            font-family: Georgia, 'Times New Roman', Times, serif;/*字体风格*/
        }
        /*选择 result*/
        .container .result{
            width: 100%;
        }
        /*选择 item*/
        .container .result .item{
            margin-top: 15px;
        }
        /*选择 item下的a、p、i标签*/
        .container .result .item a{
            display: block;/*设置为块级元素，单独占一行*/
            text-decoration: none;/*去掉下划线*/
            font-size: 20px;/*设置标题的字体大小*/
            color: #4e6ef2;/*字体颜色*/
        }
        .container .result .item p{
            font-size: 16px;
            font-family: 'Lucida Sans', 'Lucida Sans Regular', 'Lucida Grande', 'Lucida Sans Unicode', Geneva, Verdana, sans-serif;
            margin-top: 5px;
        }
        .container .result .item i{
            display: block;/*设置为块级元素，单独占一行*/
            font-style: normal;/*取消斜体*/
            color: green;
        }
    </style>
</head>
<body>
    <div class="container">
        <div class="search">
            <input type="text" value="请输入搜索关键字...">
            <button onclick="Search()">搜索一下</button>
        </div>
        <div class="result">
            <!-- 动态生成网页内容 -->
        </div>
    </div>
    <script>
        function Search(){
            //alert("hello js");//for debug，是浏览器的一个弹出框
            
            //1.提取数据
            let query=$(".container .search input").val();//$可以理解为是JQuery的别称
            console.log("query = "+query); //consloe是浏览器的对话框，可以用来查看js的数据

            //2.发起http请求
            //ajax:属于JQuery中一个可以进行前后端交互的函数
            $.ajax({
                type:"GET",
                url:"/s?word=" + query,
                success:function(data){
                    console.log(data);
                    BuildHtml(data);
                }
            });
        }
        function BuildHtml(data){
            let result_lable = $(".container .result");//获取html中的result标签
            //清空历史搜索结果
            result_lable.empty();
            //遍历data，构建搜索结果，形成动态网页
            for(let elem of data){
                //console.log(elem.title); //for debug
                //console.log(elem.content);//for debug
                
                //填充a、p、i、div标签
                let a_lable = $("<a>",{
                    text:elem.title,
                    href:elem.url,
                    target:"_blank" //跳转到新的页面
                });
                let p_lable = $("<p>",{
                    text:elem.desc
                });
                let i_lable = $("<i>",{
                    text:elem.url
                });
                let div_lable = $("<div>",{
                    class:"item"
                });
                //添加a、p、i标签到div标签下
                a_lable.appendTo(div_lable);
                p_lable.appendTo(div_lable);
                i_lable.appendTo(div_lable);
                //添加div标签到result标签下
                div_lable.appendTo(result_lable);
            }

        }
    </script>
</body>
</html>

编译并运行我们的项目代码，用浏览器访问“云服务器ip:8888”，在搜索框内输入“split”点击搜索按钮，即可看到以下页面：

点击任意一个蓝色标题，即可跳转至 boost 官网中的相应网页：

4）前端的更多优化

index.html

<!DOCTYPE html>
<html lang="en">
 
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script src="https://code.jquery.com/jquery-3.6.0.min.js"></script>
    <title>Boost 库搜索引擎</title>
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }
 
        html,
        body {
            height: 100%;
            background: url('https://images.unsplash.com/photo-1517430816045-df4b7de6d0e6') no-repeat center center fixed;
            background-size: cover;
            font-family: Arial, sans-serif;
        }
 
        .container {
            width: 90%;
            max-width: 1200px;
            margin: 50px auto;
            padding: 20px;
            background-color: rgba(255, 255, 255, 0.9);
            box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
            border-radius: 8px;
        }
 
        h1 {
            margin-bottom: 20px;
            font-size: 36px;
            color: #4e6ef2;
            text-align: center;
        }
 
        .search {
            display: flex;
            justify-content: center;
            position: relative;
        }
 
        .search input {
            flex: 1;
            height: 50px;
            border: 2px solid #ccc;
            padding-left: 10px;
            font-size: 17px;
            border-radius: 25px 0 0 25px;
            transition: border-color 0.3s;
        }
 
        .search input:focus {
            border-color: #4e6ef2;
            outline: none;
        }
 
        .search button {
            width: 160px;
            height: 50px;
            background-color: #4e6ef2;
            color: #fff;
            font-size: 19px;
            cursor: pointer;
            transition: background-color 0.3s;
            border: none;
            border-radius: 0 25px 25px 0;
        }
 
        .search button:hover {
            background-color: #3b5ec2;
        }
 
        .clear-btn {
            position: absolute;
            right: 170px;
            top: 50%;
            transform: translateY(-50%);
            cursor: pointer;
            font-size: 18px;
            display: none;
            color: #ccc;
        }
 
        .result {
            width: 100%;
            margin-top: 20px;
        }
 
        .result .item {
            margin-top: 15px;
            padding: 15px;
            background-color: #fff;
            border: 1px solid #ddd;
            border-radius: 5px;
            transition: box-shadow 0.3s;
            text-align: left;
        }
 
        .result .item:hover {
            box-shadow: 0 2px 8px rgba(0, 0, 0, 0.1);
        }
 
        .result .item a {
            display: block;
            text-decoration: none;
            font-size: 22px;
            color: #4e6ef2;
            margin-bottom: 5px;
        }
 
        .result .item a:hover {
            text-decoration: underline;
        }
 
        .result .item p {
            font-size: 16px;
            color: #333;
            margin-bottom: 5px;
        }
 
        .result .item i {
            display: block;
            font-style: normal;
            color: green;
        }
 
        .loader {
            border: 4px solid #f3f3f3;
            border-top: 4px solid #4e6ef2;
            border-radius: 50%;
            width: 40px;
            height: 40px;
            animation: spin 2s linear infinite;
            display: none;
            margin: 20px auto;
        }
 
        @keyframes spin {
            0% {
                transform: rotate(0deg);
            }
 
            100% {
                transform: rotate(360deg);
            }
        }
 
        .error-message {
            color: red;
            text-align: center;
            margin-top: 20px;
        }
 
        .pagination {
            margin-top: 20px;
            display: flex;
            justify-content: center;
        }
 
        .pagination button {
            background-color: #4e6ef2;
            color: #fff;
            border: none;
            border-radius: 5px;
            padding: 10px 15px;
            margin: 0 5px;
            cursor: pointer;
            transition: background-color 0.3s;
        }
 
        .pagination button:hover {
            background-color: #3b5ec2;
        }
 
        .pagination button:disabled {
            background-color: #ccc;
            cursor: not-allowed;
        }
 
        .previous-searches {
            margin-top: 20px;
        }
 
        .previous-searches h2 {
            font-size: 20px;
            color: #4e6ef2;
            text-align: center;
            margin-bottom: 10px;
        }
 
        .previous-searches ul {
            list-style-type: none;
            text-align: center;
        }
 
        .previous-searches ul li {
            display: inline-block;
            margin: 5px 10px;
            padding: 5px 10px;
            background-color: #4e6ef2;
            color: #fff;
            border-radius: 5px;
            cursor: pointer;
            transition: background-color 0.3s;
        }
 
        .previous-searches ul li:hover {
            background-color: #3b5ec2;
        }
    </style>
</head>
 
<body>
    <div class="container">
        <h1>Boost库搜索引擎</h1>
        <div class="search">
            <input type="text" placeholder="输入搜索关键字..." id="searchInput">
            <span class="clear-btn" id="clearBtn">&times;</span>
            <button onclick="Search()">搜索一下</button>
        </div>
        <div class="loader" id="loader"></div>
        <div class="result" id="resultContainer"></div>
        <div class="pagination" id="paginationContainer"></div>
        <div class="error-message" id="errorMessage"></div>
        <div class="previous-searches" id="previousSearches">
            <h2>之前的搜索</h2>
            <ul id="previousSearchList"></ul>
        </div>
    </div>
    <script>
        let currentPage = 1;
        const resultsPerPage = 8;
        let allResults = [];
        let previousSearches = JSON.parse(localStorage.getItem('previousSearches')) || [];
 
        $(document).ready(function () {
            $("#searchInput").on("input", function () {
                if ($(this).val()) {
                    $("#clearBtn").show();
                } else {
                    $("#clearBtn").hide();
                }
            });
 
            $("#clearBtn").on("click", function () {
                $("#searchInput").val('');
                $(this).hide();
            });
 
            displayPreviousSearches();
        });
 
        function Search() {
            const query = $("#searchInput").val().trim();
            if (!query) {
                alert("请输入搜索关键字！");
                return;
            }
 
            if (!previousSearches.includes(query)) {
                if (previousSearches.length >= 5) {
                    previousSearches.shift();
                }
                previousSearches.push(query);
                localStorage.setItem('previousSearches', JSON.stringify(previousSearches));
            }
 
            $("#loader").show();
            $("#errorMessage").text('');
            $.ajax({
                type: "GET",
                url: "/s?word=" + query,
                success: function (data) {
                    $("#loader").hide();
                    allResults = data;
                    currentPage = 1;
                    displayResults();
                },
                error: function () {
                    $("#loader").hide();
                    $("#errorMessage").text('搜索失败，请稍后重试。');
                }
            });
        }
 
        function displayResults() {
            const resultContainer = $("#resultContainer");
            const paginationContainer = $("#paginationContainer");
            resultContainer.empty();
            paginationContainer.empty();
 
            const totalResults = allResults.length;
            const totalPages = Math.ceil(totalResults / resultsPerPage);
 
            if (totalResults === 0) {
                $("#errorMessage").text('没有搜索到相关的内容。');
                return;
            }
 
            const start = (currentPage - 1) * resultsPerPage;
            const end = Math.min(start + resultsPerPage, totalResults);
            const currentResults = allResults.slice(start, end);
 
            currentResults.forEach(elem => {
                const item = $(`
                    <div class="item">
                        <a href="${elem.url}" target="_blank">${elem.title}</a>
                        <p>${elem.desc}</p>
                        <i>${elem.url}</i>
                    </div>
                `);
                resultContainer.append(item);
            });
 
            displayPagination(totalPages);
            displayPreviousSearches();
        }
 
        function displayPagination(totalPages) {
            const paginationContainer = $("#paginationContainer");
 
            if (currentPage > 1) {
                const prevButton = $('<button>上一页</button>');
                prevButton.on('click', function () {
                    currentPage--;
                    displayResults();
                });
                paginationContainer.append(prevButton);
            }
 
            let startPage, endPage;
            if (totalPages <= 5) {
                startPage = 1;
                endPage = totalPages;
            } else {
                if (currentPage <= 3) {
                    startPage = 1;
                    endPage = 5;
                } else if (currentPage + 2 >= totalPages) {
                    startPage = totalPages - 4;
                    endPage = totalPages;
                } else {
                    startPage = currentPage - 2;
                    endPage = currentPage + 2;
                }
            }
 
            for (let i = startPage; i <= endPage; i++) {
                const button = $(`<button>${i}</button>`);
                if (i === currentPage) {
                    button.prop('disabled', true);
                }
                button.on('click', function () {
                    currentPage = i;
                    displayResults();
                });
                paginationContainer.append(button);
            }
 
            if (currentPage < totalPages) {
                const nextButton = $('<button>下一页</button>');
                nextButton.on('click', function () {
                    currentPage++;
                    displayResults();
                });
                paginationContainer.append(nextButton);
            }
        }
 
        function displayPreviousSearches() {
            const previousSearchList = $("#previousSearchList");
            previousSearchList.empty();
 
            previousSearches.forEach(search => {
                const item = $(`<li>${search}</li>`);
                item.on('click', function () {
                    $("#searchInput").val(search);
                    Search();
                });
                previousSearchList.append(item);
            });
        }
    </script>
</body>
 
</html>

编译并运行我们的项目代码，用浏览器访问“云服务器ip:8888”，就可以看到如下页面：

八、其他细节的完善

1）解决搜索结果出现重复文档的问题

在构建倒排索引的时候，一个搜索关键字被 cppjieba 分成了很多词，而这些分词都对应同一个文档 ID，这就可能导致在搜索的时候，输入一个完整的搜索关键字后会出现许多个相同的网页。因此，我们还需要对网页的搜索结果进行去重操作。

Searcher.hpp

#pragma once

#include"index.hpp"
#include"util.hpp"
#include<algorithm>
#include<jsoncpp/json/json.h>
#include<unordered_map>
namespace ns_searcher
{
    struct InvertedElemPrint{//去重
        uint64_t doc_id;
        int weight;
        std::vector<std::string> words;
        InvertedElemPrint():doc_id(0),weight(0){}
    };

    class Searcher
    {
    private:
        ns_index::Index *index;//供系统进行查找的索引
    public:
        Searcher(){}
        ~Searcher(){}
    public:
        //初始化Searcher模块
        void InitSearcher(const std::string &input)
        {
            //1.获取或创建index索引对象
            index = ns_index::Index::GetInstance();
            std::cout<<"单例获取成功..."<<std::endl; //for debug
            //2.根据index对象建立索引
            index->BuildIndex(input);
            std::cout<<"索引构建成功..."<<std::endl; //for debug
        }

        //进行关键字的搜索
        //  query:搜索关键字
        //  json_string:返回给用户浏览器的搜索结果
        void Search(const std::string &query,std::string *json_string)
        {
            //1.【分词】：对query按照searcher的要求进行分词
            std::vector<std::string> words;
            ns_util::JiebaUtil::CutString(query,&words);
            //2.【触发】：根据分词后的各个词，进行index查找
            // ns_index::InvertedList_t inverted_list_all; //保存所有倒排索引的结果
            std::vector<InvertedElemPrint> inverted_list_all;          //保存去重后的数据
            std::unordered_map<uint64_t,InvertedElemPrint> tokens_map; //去重

            //遍历分词后的每个词
            for(std::string word:words)
            {
                boost::to_lower(word);//与构建索引时一样，搜索时也要统一为小写
                //先进行倒排索引
                ns_index::InvertedList_t *inverted_list=index->GetInvertedList(word);
                if(inverted_list==nullptr)
                    continue; //就算本次倒排没找到，也继续找
                //inverted_list_all.insert(inverted_list_all.end(),inverted_list->begin(),inverted_list->end()); //找到了，将结果批量化地插入inverted_list_all
                
                //对文档ID相同的关键字进行去重，并将其保存至tokens_map中
                for(const auto &elem:*inverted_list)
                {
                    auto &item=tokens_map[elem.doc_id];//存在就直接获取；不存在就新建
                    //由 operator[] 的特性，此时item一定是doc_id相同的InvertedElemPrint节点
                    item.doc_id=elem.doc_id;         //赋值doc_id
                    item.weight+=elem.weight;        //将doc_id相同的关键字的权重累加起来
                    item.words.push_back(elem.word); //保存doc_id相同的关键字
                }
            }
            //遍历tokens_map，将它的元素存放到新的倒排拉链集合中（这部分数据就不存在重复文档了）
            for(const auto &item:tokens_map)
            {
                inverted_list_all.push_back(std::move(item.second));
            }

            //3.【合并排序】：汇总搜索结果，按照相关性（权重weight）进行降序排序
            // std::sort(inverted_list_all.begin(),inverted_list_all.end(),
            //     [](const ns_index::InvertedElem &e1,const ns_index::InvertedElem &e2){
            //         return e1.weight>e2.weight;
            //     });
            std::sort(inverted_list_all.begin(),inverted_list_all.end(),
                [](const InvertedElemPrint &e1,const InvertedElemPrint &e2){
                    return e1.weight>e2.weight;
                });


            //4.【构建】：借助jsoncpp，根据搜索结果，构建json串
            Json::Value root;
            //进行正排索引
            for(auto &item : inverted_list_all)
            {
                ns_index::DocInfo *doc=index->GetForwardIndex(item.doc_id);
                if(doc==nullptr)
                    continue; //就算本次正排没找到，也继续找
                
                Json::Value elem;
                elem["title"]=doc->title;
                elem["desc"]=GetDesc(doc->content,item.words[0]);
                elem["url"]=doc->url;

                //for debug
                //elem["id"]=(int)item.doc_id;
                //elem["weight"]=item.weight;

                root.append(elem);
            }
            //Json::StyledWriter writer;
            Json::FastWriter writer;
            *json_string=writer.write(root);
        }

        //获取网页摘要
        std::string GetDesc(const std::string &html_content,const std::string &word)
        {
            //找到word在html_content中首次出现的位置
            //然后往前找50字节（不足就从头开始），往后找100字节（不足就到结尾）
            //并截取出这部分内容，作为摘要返回

            //1.找到首次出现的位置
            auto iter=std::search(html_content.begin(),html_content.end(),word.begin(),word.end(),
                [](int x,int y){
                    return (std::tolower(x)==std::tolower(y));
            });
            if(iter==html_content.end())//这种情况一般是不存在的,除非文档内容里真的没有关键字，那也说明代码有bug
                return "None";
            int pos=std::distance(html_content.begin(),iter);

            //2.获取start、end
            const int prev_step =50;
            const int next_step=100;
            int start=0;                  //默认在文档开头
            int end=html_content.size()-1;//默认在文档结尾
            if(pos > start+prev_step)
                start=pos-prev_step; //更新位置
            if(pos < end-next_step)
                end=pos+next_step;   //更新位置

            //3.截取字符串
            if(start >= end) //这种情况一般也是不存在的，除非数据类型有问题、代码有bug
                return "None";
            std::string desc=html_content.substr(start,end-start);
            desc+="...";
            return desc;
        }
    };
}

2）去掉暂停词

尽管 cppjieba 会对搜索关键字进行分词，但它会保留关键字中的暂停词，例如的、了、is、a、the 等，而这些暂停词在文档中出现的频率本身就特别高，如果它们被保留了，就可能对搜索结果产生较大的影响。因此，我们还需要在分词时去掉暂停词，以保证搜索结果的正确性。

util.hpp

#pragma once

#include<iostream>
#include<vector>
#include<string>
#include<fstream>
#include<boost/algorithm/string.hpp>
#include"cppjieba/Jieba.hpp"
#include<unordered_map>
#include<mutex>
namespace ns_util
{
    class FileUtil
    {
        public:
          static bool ReadFile(const std::string &file_path, std::string *out)
            {
                std::ifstream in(file_path, std::ios::in);
                if(!in.is_open()){
                    std::cerr << "open file " << file_path << " error" << std::endl;
                    return false;
                }

                std::string line;
                while(std::getline(in, line)){ //如何理解getline读取到文件结束呢？？getline的返回值是一个&，while(bool), 本质是因为重载了强制类型转化
                    *out += line;
                }

                in.close();
                return true;
            }
    };

    class StringUtil
    {
    public:
        static void Split(const std::string &target,std::vector<std::string> *out,const std::string &sep)
        {
            //boost split
            boost::split(*out,target,boost::is_any_of(sep),boost::token_compress_on);
            //第一个参数：表示将切分后的字符串放到哪里
            //第二个参数：表示待切分的字符串
            //第三个参数：表示具体的分割符是什么，不管是多个还是一个
            //第四个参数：默认可以不传，即切分的时候不压缩（也就是保留空格）；
            //           要传参的话，token_compress_on表示要压缩，token_compress_off表示不压缩。
        }
    };

    //词库路径：
    const char* const DICT_PATH = "./dict/jieba.dict.utf8";
    const char* const HMM_PATH = "./dict/hmm_model.utf8";
    const char* const USER_DICT_PATH = "./dict/user.dict.utf8";
    const char* const IDF_PATH = "./dict/idf.utf8";
    const char* const STOP_WORD_PATH = "./dict/stop_words.utf8";
    class JiebaUtil
    {
    private:    
        //static cppjieba::Jieba jieba; 
        //将成员变量，定义为静态的，使其不必在外部每次调用CutString()时，都重新初始化一次
        //但注意，静态成员变量需要在类外初始化

        cppjieba::Jieba jieba;
        std::unordered_map<std::string,bool> stop_words;
        //将JiebaUtil类设为单例
        JiebaUtil()
            :jieba(
            DICT_PATH, 
            HMM_PATH, 
            USER_DICT_PATH, 
            IDF_PATH, 
            STOP_WORD_PATH)
        {}
        JiebaUtil(const JiebaUtil&)=delete;
        static JiebaUtil *instance;
    public:
        static JiebaUtil *get_instance()
        {
            static std::mutex mtx;
            if(instance==nullptr){
                mtx.lock();
                if(instance==nullptr){
                    instance=new JiebaUtil();
                    instance->InitJiebaUtil();
                }
                mtx.unlock();
            }
            return instance;
        }
        void InitJiebaUtil()
        {
            std::ifstream in(STOP_WORD_PATH);
            if(!in.is_open()){
                //LOG(FATAL."load stop words file error");
                return ;
            }
            std::string line;
            while(std::getline(in,line)){
                stop_words.insert({line,true});
            }
            in.close();
        }
        void CutStringHelper(const std::string &src,std::vector<std::string> *out)
        {
            //分词
            jieba.CutForSearch(src,*out);
            //遍历分词结果，去掉暂停词
            for(auto iter=out->begin();iter!=out->end();)
            {
                auto it=stop_words.find(*iter);
                if(it!=stop_words.end()){ //说明当前的string 是一个需要去掉的暂停词
                    iter=out->erase(iter);
                }else{
                    iter++;
                }
            }
        }
        
    public:
        //1.类内的静态方法支持外部调用
        //2.要在自己内部调用静态成员的方法，自己也得是静态的
        static void CutString(const std::string &src,std::vector<std::string> *out)
        {
            //jieba.CutForSearch(src,*out);
            ns_util::JiebaUtil::get_instance()->CutStringHelper(src,out);
        }
    };
    // cppjieba::Jieba JiebaUtil::jieba(
    //     DICT_PATH, 
    //     HMM_PATH, 
    //     USER_DICT_PATH, 
    //     IDF_PATH, 
    //     STOP_WORD_PATH);
    
    JiebaUtil *JiebaUtil::instance=nullptr;
}

3）添加日志模块

【补】日志的作用

调试和错误追踪：记录程序执行过程中的各种状态和错误信息，方便定位和修复问题。
运行监控：监控程序的运行状态，了解程序的执行流程和重要事件。
审计和分析：分析日志记录，了解用户行为和系统性能，进行数据挖掘和改进。

现在 Boost_Searcher 目录下创建一个 log.hpp 头文件，在其中编写日志模块，随后在 index.hpp、Searcher.hpp、http_server.cc 中添加相应的日志模块。

log.hpp

#pragma once    
#include <iostream>    
#include <string>    
#include <ctime>    
    
//日志等级：
#define NORMAL 1   //正常的                                                                                                                                                                     
#define WARNING 2  //错误的     
#define DEBUG 3    //bug    
#define FATAL 4    //致命的   
    
#define LOG(LEVEL, MESSAGE) log(#LEVEL, MESSAGE, __FILE__, __LINE__) //宏函数        
    
void log(std::string level, std::string message, std::string file, int line)    
{    
    std::cout << "[" << level << "]" << "[" << time(nullptr) << "]" << "[" << message << "]" << "[" << file << " : " << line << "]" << std::endl;    
} 
/*
简单说明：   
    我们用宏来实现日志功能，其中LEVEL表明的是等级（有四种），
    这里的#LEVEL的作用是：把一个宏参数变成对应的字符串（直接替换）
C语言中的预定义符号：
    __FILE__：进行编译的源文件
    __LINE__：文件的当前行号
*/

index.hpp

#pragma once
#include<iostream>
#include<string>
#include<vector>
#include<unordered_map>
#include<fstream>
#include"util.hpp"
#include<mutex>//c++的互斥锁
#include"log.hpp"
namespace ns_index
{
    struct DocInfo //文档内容
    {
        std::string title;  //文档标题
        std::string content;//文档的有效内容（已去标签）
        std::string url;    //官网文档url
        uint64_t doc_id;         //文档的ID
    };
    struct InvertedElem //倒排的节点元素
    {
        uint64_t doc_id;     //文档的ID
        std::string word;//关键字
        int weight;      //排序的权重
        InvertedElem():weight(0){}
    };
    typedef std::vector<InvertedElem> InvertedList_t; //倒排拉链

    class Index //索引
    {
    private:
        //正排索引的数据结构选用数组，因为数组的天然下标可以充当文档ID
        std::vector<DocInfo> forward_index;
        //倒排索引是一个关键字和一组InvertedElem（倒排拉链）的映射关系
        std::unordered_map<std::string,InvertedList_t> inverted_index;

    
    private:
        Index(){}                             //构造函数私有化
        Index(const Index&)=delete;           //防拷贝
        Index& operator=(const Index&)=delete;//防赋值
        static Index* instance;               //用静态指针支持在类外进行初始化
        static std::mutex mtx;               //互斥锁保证数据读取时的线程安全
    public:
        ~Index(){}
    public:
        //获取或创建Index对象
        static Index* GetInstance()
        {
            if(instance==nullptr) //Index单例对象不存在,就在临界区内创建一个
            {
                mtx.lock();  //加锁
                if(instance == nullptr) //Index单例对象不存在，就new创建一个
                    instance=new Index();
                mtx.unlock();//解锁
            }
            return instance;
        }


    public:
        //根据文档ID找到文档内容（进行正排索引）
        DocInfo *GetForwardIndex(const uint64_t &doc_id)
        {
            //若doc_id不越界，则直接返回元素 forward_index[doc_id] 即可
            if(doc_id>=forward_index.size())
            {
                std::cerr<<"doc_id out range!"<<std::endl;
                return nullptr;
            }
            return &forward_index[doc_id];
        }
        //根据关键字，获得倒排拉链（进行倒排索引）
        InvertedList_t *GetInvertedList(const std::string &word)
        {
            //若word有效，则直接返回 inverted_index[word] 即可
            auto it=inverted_index.find(word);
            if(it==inverted_index.end())
            {
                std::cerr<<word<<" have no InvertedList"<<std::endl;
                return nullptr;
            }
            return &(it->second);
        }
        //根据文档的有效内容分别构建正排索引和倒排索引
        bool BuildIndex(const std::string &input)
        {
            //1.打开文件
            //  raw.txt是以二进制方式写入的，当然也以二进制方式来读取
            std::ifstream in(input,std::ios::in | std::ios::binary);
            if(!in.is_open())
            {
                std::cerr<<input<<" open error!"<<std::endl;
                return false;
            }
            //2.按行读取
            std::string line;
            int count=0; //for debug
            while(std::getline(in,line))
            {
                DocInfo* doc=BuildForwardIndex(line); //先构建正排索引
                if(doc==nullptr)
                {
                    std::cerr<<"build: "<<line<<" error"<<std::endl;//for debug
                    continue;
                }
                BuildInvertedIndex(*doc);            //再构建倒排索引

                //for debug
                count++;
                if(count%1000==0)
                    //std::cout<<"当前已建立的索引文档："<< count << std::endl;
                    LOG(NORMAL,"当前已建立的索引文档: "+std::to_string(count));
            }
            //3.关闭文件
            in.close();
            return true;
        }
    private:
        //构建正排索引
        DocInfo *BuildForwardIndex(const std::string &line)
        {
            //1.解析line，进行字符串切分，提取title、content、url
            std::vector<std::string> results;
            const std::string sep="\3";
            ns_util::StringUtil::Split(line,&results,sep);
            if(results.size()!=3)
                return nullptr;
            //2.将切分好的字符串填充至一个DocInfo对象
            DocInfo doc;
            doc.title=results[0];
            doc.content=results[1];
            doc.url=results[2];
            doc.doc_id=forward_index.size(); //先保存doc_id，再插入vector，如此，doc_id就是插入后的vector的下标，即forward_index.size()-1
            //3.将填充好的DocInfo对象插入到正排索引的vector
            forward_index.push_back(doc);

            return &forward_index.back();//最终返回forward_index尾部元素的地址，即刚插入的DocInfo对象的地址
        }
        //构建倒排索引
        bool BuildInvertedIndex(const DocInfo &doc)
        {
            struct word_cnt //统计词频
            {
                int title_cnt;
                int content_cnt;
                word_cnt():title_cnt(0),content_cnt(0){}
            };
            std::unordered_map<std::string,word_cnt> word_map;//<关键字，词频>
            //对title进行分词和词频统计
            std::vector<std::string> title_words;
            ns_util::JiebaUtil::CutString(doc.title,&title_words);
            for(auto &s:title_words)
            {    
                boost::to_lower(s); //统一转化成为小写的,这是因为一般在搜索时其实不区分大小写，例如hello、Hello、HELLO都能搜索出相同的结果
                word_map[s].title_cnt++;//operator[]能够修改容器中已存在的s的实值，对于不存在的s，会先自动创建键值s，再修改实质
            }
            //对content进行分词和词频统计
            std::vector<std::string> content_words;
            ns_util::JiebaUtil::CutString(doc.title,&content_words);
            for(auto &s:content_words)
            {
                boost::to_lower(s); //统一转化成为小写的 
                word_map[s].content_cnt++;//operator[]能够修改容器中已存在的s的实值，对于不存在的s，会先自动创建键值s，再修改实质
            }
            //自定义相关性 + 填充倒排节点
            #define X 10
            #define Y 1
            for(auto& word_pair:word_map)
            {
                //填充字段
                InvertedElem item;
                item.doc_id=doc.doc_id;
                item.word=word_pair.first;
                item.weight=X*word_pair.second.title_cnt+Y*word_pair.second.content_cnt;
                //尾插倒排拉链
                InvertedList_t &inverted_list=inverted_index[word_pair.first];
                inverted_list.push_back(std::move(item));
            }
            return true;
        }
    };
    Index* Index::instance=nullptr;//类外初始化
    std::mutex Index::mtx;
}

Searcher.hpp

#pragma once

#include"index.hpp"
#include"util.hpp"
#include<algorithm>
#include<jsoncpp/json/json.h>
#include<unordered_map>
#include"log.hpp"
namespace ns_searcher
{
    struct InvertedElemPrint{//去重
        uint64_t doc_id;
        int weight;
        std::vector<std::string> words;
        InvertedElemPrint():doc_id(0),weight(0){}
    };

    class Searcher
    {
    private:
        ns_index::Index *index;//供系统进行查找的索引
    public:
        Searcher(){}
        ~Searcher(){}
    public:
        //初始化Searcher模块
        void InitSearcher(const std::string &input)
        {
            //1.获取或创建index索引对象
            index = ns_index::Index::GetInstance();
            //std::cout<<"单例获取成功..."<<std::endl; //for debug
            LOG(NORMAL,"获取index单例成功...");
            //2.根据index对象建立索引
            index->BuildIndex(input);
            //std::cout<<"索引构建成功..."<<std::endl; //for debug
            LOG(NORMAL,"索引构建成功...");
        }

        //进行关键字的搜索
        //  query:搜索关键字
        //  json_string:返回给用户浏览器的搜索结果
        void Search(const std::string &query,std::string *json_string)
        {
            //1.【分词】：对query按照searcher的要求进行分词
            std::vector<std::string> words;
            ns_util::JiebaUtil::CutString(query,&words);
            //2.【触发】：根据分词后的各个词，进行index查找
            // ns_index::InvertedList_t inverted_list_all; //保存所有倒排索引的结果
            std::vector<InvertedElemPrint> inverted_list_all;
            std::unordered_map<uint64_t,InvertedElemPrint> tokens_map; //去重
           
            for(std::string word:words)
            {
                boost::to_lower(word);//与构建索引时一样，搜索时也要统一为小写
                //先进行倒排索引
                ns_index::InvertedList_t *inverted_list=index->GetInvertedList(word);
                if(inverted_list==nullptr)
                    continue; //就算本次倒排没找到，也继续找
                //inverted_list_all.insert(inverted_list_all.end(),inverted_list->begin(),inverted_list->end()); //找到了，将结果批量化地插入inverted_list_all
                
                //对文档ID相同的关键字进行去重
                for(const auto &elem:*inverted_list)
                {
                    auto &item=tokens_map[elem.doc_id];//存在就直接获取；不存在就新建
                    //此时item一定是doc_id相同的print节点
                    item.doc_id=elem.doc_id;
                    item.weight+=elem.weight;
                    item.words.push_back(elem.word);
                }
            }
            for(const auto &item:tokens_map)
            {
                inverted_list_all.push_back(std::move(item.second));
            }

            //3.【合并排序】：汇总搜索结果，按照相关性（权重weight）进行降序排序
            // std::sort(inverted_list_all.begin(),inverted_list_all.end(),
            //     [](const ns_index::InvertedElem &e1,const ns_index::InvertedElem &e2){
            //         return e1.weight>e2.weight;
            //     });
            std::sort(inverted_list_all.begin(),inverted_list_all.end(),
                [](const InvertedElemPrint &e1,const InvertedElemPrint &e2){
                    return e1.weight>e2.weight;
                });


            //4.【构建】：借助jsoncpp，根据搜索结果，构建json串
            Json::Value root;
            //进行正排索引
            for(auto &item : inverted_list_all)
            {
                ns_index::DocInfo *doc=index->GetForwardIndex(item.doc_id);
                if(doc==nullptr)
                    continue; //就算本次正排没找到，也继续找
                
                Json::Value elem;
                elem["title"]=doc->title;
                elem["desc"]=GetDesc(doc->content,item.words[0]);
                elem["url"]=doc->url;

                //for debug
                //elem["id"]=(int)item.doc_id;
                //elem["weight"]=item.weight;

                root.append(elem);
            }
            //Json::StyledWriter writer;
            Json::FastWriter writer;
            *json_string=writer.write(root);
        }

        //获取网页摘要
        std::string GetDesc(const std::string &html_content,const std::string &word)
        {
            //找到word在html_content中首次出现的位置
            //然后往前找50字节（不足就从头开始），往后找100字节（不足就到结尾）
            //并截取出这部分内容，作为摘要返回

            //1.找到首次出现的位置
            auto iter=std::search(html_content.begin(),html_content.end(),word.begin(),word.end(),
                [](int x,int y){
                    return (std::tolower(x)==std::tolower(y));
            });
            if(iter==html_content.end())//这种情况一般是不存在的,除非文档内容里真的没有关键字，那也说明代码有bug
                return "None";
            int pos=std::distance(html_content.begin(),iter);

            //2.获取start、end
            const int prev_step =50;
            const int next_step=100;
            int start=0;                  //默认在文档开头
            int end=html_content.size()-1;//默认在文档结尾
            if(pos > start+prev_step)
                start=pos-prev_step; //更新位置
            if(pos < end-next_step)
                end=pos+next_step;   //更新位置

            //3.截取字符串
            if(start >= end) //这种情况一般也是不存在的，除非数据类型有问题、代码有bug
                return "None";
            std::string desc=html_content.substr(start,end-start);
            desc+="...";
            return desc;
        }
    };
}

http_server.cc

#include"Searcher.hpp"
#include"cpp-httplib/httplib.h"
const std::string input="data/raw_html/raw.txt";//索引数据源
const std::string root_path="./wwwroot";//web根目录的路径

int main()
{
    //初始化索引
    ns_searcher::Searcher search;
    search.InitSearcher(input);



    //创建一个Server对象，本质就是搭建服务端
    httplib::Server svr;
    
    //访问首页
    svr.set_base_dir(root_path.c_str());

    //根据get方法获取的http请求，构建http响应
    svr.Get("/s",[&search](const httplib::Request &req,httplib::Response &rsp){
        if(!req.has_param("word"))//检测用户的请求中是否有搜索关键字
        {
            rsp.set_content("必须要有搜索关键字！","text/plain;charset=utf-8");
            return;
        }
        //获取用户输入的关键字
        std::string word=req.get_param_value("word");
        //std::cout<<"用户在搜索："<< word <<std::endl; //for debug
        LOG(NORMAL,"用户在搜索："+word);
        //根据关键字，构建json串
        std::string json_string;
        search.Search(word,&json_string);
        //设置 get "s" 请求返回的内容，返回的是根据关键字，构建json串内容
        rsp.set_content(json_string,"application/json");
    });

    LOG(NORMAL,"服务器启动成功...");
    
    // 绑定端口（8888），启动监听（0.0.0.0表示监听任意端口）
    svr.listen("0.0.0.0",8888);
    return 0;
}

编译后运行的效果：

4）部署项目到 linux

我们现在 Boost_Searcher 目录下创建一个 Log 目录，然后在 Log 目录中创建一个 log.txt 文件。

接下来我们在 Boost_Searcher 目录下使用指令：“nohup ./http_server > Log/log.txt 2>&1 &”，即可将我们的服务器程序 http_server 以守护进程的方式一直在后端运行，即使退出了 xshell 会话，仍可以继续通过浏览器访问我们的服务器，而程序运行时打印的信息都会存入 Log/log.txt 中。

通过指令：“ps axj | grep http_server”可以查看服务器程序的进程 PID 和进程状态。

如果要关闭服务器程序，输入指令“kill -9”加上服务器程序的进程 PID 即可。

九、项目总结

功能：实现boost文档站内搜索引擎，通过输入查询内容，将与查询内容有关文档的网页按该词的权值降序显示出来，包括标题、内容摘要和网页url，通过点击标题可直接跳转boost库网页进行文档阅读。
原理：

框架：

技术栈：

项目环境： Linux CentOS7 （或 ubuntu 24.04）云服务器、vim/ gcc/ g++/ Makefile、vs 2022 / vscode
gitee 源码地址：

https://gitee.com/the-driest-one-in-varoran/boost-internal-search-engine.git

项目拓展：

（1）建立整站搜索。

我们搜索的内容是在boost库下的doc目录下的html文档，你可以将这个库建立搜索，也可以将所有的版本，但是成本是很高的，对单个版本的整站搜索还是可以完成的，取决于你服务器的配置。

（2）设计一个在线更新的方案，信号，爬虫，完成整个服务器的设计。

（3）不使用组件，而是自己设计一下对应的各种方案

（4）在搜索引擎中，添加竞价排名

毕竟一些搜索引擎是盈利性的，比如说百度搜索一些东西排在最上面的一般都是广告。我们可以通过调高weight来实现。

（5）热词统计

智能显示搜索关键词，可以通过字典树，优先级队列来实现。

（6）多线程分词

因为我们要对所有内容进行分词这个工作量是巨大的，我们可以通过多线程来进行进行分词功能，最后合并起来，来节约时间。

（7）设置登陆注册，引入对mysql的使用

NeeEk0

关注

23
点赞
踩
18

收藏

觉得还不错? 一键收藏
0
评论
【项目综合】基于 Boost 库的站内搜索引擎（保姆式讲解，小白包看包会！）

获取网页内容摘要的功能，作为一个函数，编写在 Searcher.hpp 中。private://供系统进行查找的索引public:public://初始化Searcher模块//...//进行关键字的搜索//1.【分词】：对query按照searcher的要求进行分词//...//2.【触发】：根据分词后的各个词，进行index查找//...//3.【合并排序】：汇总搜索结果，按照相关性（权重weight）进行降序排序//...
复制链接

扫一扫