站内搜索引擎——01_数据去标签与数据清洗

请添加图片描述

✨✨欢迎来到T_X_Parallel的博客！！
🛰️博客主页：T_X_Parallel
🛰️项目代码仓库：站内搜索引擎项目代码仓库
🛰️专栏 : 站内搜索引擎项目
🛰️欢迎关注：👍点赞🙌收藏✍️留言

文章目录

项目环境：Linux云服务器（centos7.9）、vscode1.85.2、g++/CMake

技术栈：C/C++ C++11、STL、标准库 Boost、Jsoncpp、cppjieba、cpp-httplib、html

1.boost文档html源代码下载与处理

首先需要boost官网里提供的网站html源代码，借此来实现我们的项目

通过windows下载后在云服务器中使用命令 rz -E 将我们下载好的压缩文件上传到云服务器中

然后再通过命令 tar xzf 文件名 将压缩文件解压

最后通过命令 mv /版本号/doc/html/ /data/input 移动到我们创建好的项目数据目录下

在这里，我们只需要/doc/html/下的文件，其他则不需要，移动好后可以删除其他文件

接下来就是编写代码处理这些html文件内容

2.编写程序实现数据去标签与清洗

第一步保存文件路径与文件名

这一步中我们需要Boost库来实现，所以我们需要先安装boost库

以我的机器Centos7.9为准的话，运行命令 sudo yum install -y boost-devel

在Boost库中有一个包为filesystem，我们要使用里面的path对象和exists()函数和recursive_directory_iterator迭代器对象和is_regular_file()函数等

创建 bool EnumFiles(const std::string &path, std::vector<std::string> &file_lists) 函数来保存文件路径与文件名，path是保存.html文件的路径，file_lists来保存文件路径与文件名

EnumFiles函数中先将boost::filesystem定义一个别名，方便使用boost的filesystem库（下面就用fs来替代boost::filesystem）

namespace fs = boost::filesystem;

用path变量来初始化fs::path对象，并判断该路径是否存在

    fs::path rpath(path);

    if (!fs::exists(rpath))
    {
        std::cerr << path << " Path not exists!" << std::endl;
        return false;
    }

使用迭代器 iter 来遍历指定的目录下的每个文件，用if来判断哪些文件是普通.html文件

其中 fs::is_regular_file() 函数用来判断遍历到的文件是否为普通文件（.html是普通文件）

iter->path().extension() 获取遍历到的文件的后缀

    fs::recursive_directory_iterator end_iter;
    for (fs::recursive_directory_iterator iter(rpath); iter != end_iter; ++iter)
    {
        if (!fs::is_regular_file(*iter))
            continue;
        if (iter->path().extension() != ".html")
            continue;
        file_lists.push_back(iter->path().string());
    }

筛选出不是.html普通文件后，只剩下所需的文件，直接push加入进输入输出参数 std::vector<std::string> &file_lists 中，就得到了所有.html文件的文件路径（实现完后可以打印验证一下结果是否正确）

在这里插入图片描述

第二步获取.html文件中的内容并且去标签与清洗

首先先明确目标，我们需要将.html文件中的标签去掉，那么就要先分析一个普通.html中有哪些东西

在这里插入图片描述

通过观察，不难发现，主要内容从 “>” 开始，到 “<” 结束,所以只要只要去遍历一次文件的内容，即可提取出主要内容。在内容的去标签与数据清洗的过程中我们需要获取除了主要内容，还有标题与url（为了方便后面的索引与跳转），所以可以使用一个数据结构来管理一个文件的这些属性。

struct File_Info
{
    std::string title; //文件标题
    std::string content; //文件内容（去标签后的）
    std::string url;//文件所对应的网站地址url
};

关于url的获取，通过观察官网其中一个文档的地址（例如https://www.boost.org/doc/libs/1_86_0/doc/html/array.html）

不难看出后半部分与Linux或者windows中的文件路径很相似 /home/Parallel_9/project/boost/data/input/html ，

所以获取一个文件的url只需要将文件路径进行删除与拼接就可以了

首先创建用于去标签与清洗数据的主函数，在该函数中去读取每个文件的内容，然后再调用其他函数进行对内容进行清洗

bool ParseHtml(const std::vector<std::string> &file_lists, std::vector<File_Info> &file_infos)

file_lists 为我们上一步所获取的所有.html文件的文件路径，file_infos 为最终的结果

在该函数中首先需要遍历每个文件路径，然后通过该文件路径去获取文件中所有内容，可以使用 read() 函数来读取内容。在这里，因为可能以后还需要读取文件内容，所以可以创建一个头文件 util.hpp 去实现一些通用函数，有些函数就不需要重复去实现了

// util.hpp

// 注：头文件

namespace ns_util
{
    class FileUtil
    {
    public:
        static bool ReadFile(const std::string &file, std::string &content)
        {
            // 打开文件
            std::ifstream in(file.c_str());
            if (!in.is_open())
            {
                std::cerr << "Open file " << file << " failed!" << std::endl;
                return false;
            }
            //使用getline读取文件内容
            std::string line;
            while (std::getline(in, line))
                content += line;
            // 关闭文件
            in.close();
            return true;
        }
    };
    // ······
} 
// 基础的文件读取过程，就不过多赘述

获取到遍历到的文件的内容后，就可以对内容的去标签和数据清洗。

首先先从文件内容中获取文件标题title，从之前的.html文件内容示例图中可以很容易找到标题，即<title>与</title>中间的就是文件的标题title

在这里插入图片描述

所以只需找文件内容中的<title>和</title>位置，然后使用 substr() 函数即可得到标题title

注：使用 find() 函数找到的<title>位置是在 ‘<’ 的位置，所以需要加上 <title> 的长度才是title的起始位置

同时需要注意判断该文件中是否存在 title ，不存在返回 false

然后获取文件的去标签后的内容 content

在这里插入图片描述

截取某一.html文件中的一段代码，不难看出来，红框中就是正文内容，其他都是标签，我们的目的就是去掉这些标签，而标签总是从 ‘<’ 开始到 ‘>’ 结束，所以在遍历文件内容时我们只有两个状态，一个是在标签 Label 中，一个是在内容 Content 中，这里可以使用枚举来表示两个状态

enum Status
{
    LABEL, //标签
	CONTENT, //内容
};

当遍历到内容中的字符时只需将该字符加入到最终要返回的结果中即可，遍历一遍下来即可获取指定文件的去标签后的内容

注：在遍历过程中我们会遇到换行符 ‘\n’ ，这里需要将换行符替换成空格 ’ ’ ，因为后面会使用 ‘\n’ 作为每个html解析后的文本之间的分隔符

最后获取文件对应的url，在上面已经分析过，只需将文件名拼接到固定的地址后即是该文件的url，这个函数需要的参数是文件路径而不是文件内容，与上面两个不太一样，需要注意

const std::string url_head = "https://www.boost.org/doc/libs/1_86_0/doc/html/"; //一定要是对应自己下载的版本
const std::string url_tail = file_path.substr(src_path.size());
url = url_head + url_tail;

在每个文件清洗完后只要将结果push加入到数组 file_infos ，遍历完即可获取每个文件我们所需的属性（实现完后可以打印验证一下结果是否正确）

在这里插入图片描述

第三步将解析好的文件内容保存

将前面两步解析好的文件内容按指定格式保存进指定文件中，这个文件里的内容就是所有.html文件内容去标签与数据清洗后的内容，为后面的工作提供数据

这里需要规定一个格式，每个html中的title、content和url之前用 ‘\3’ 隔开，每个html文件内容之间用 ‘\n’ 隔开

（提示：文件写入可以使用write()函数，也可以使用 ‘<<’ ,第二种方式比较方便与cout打印差不多）

最后一步测试结果：
在这里插入图片描述

这一步骤只是基础的文件写入操作，具体实现看下面的总代码。

源代码

数据去标签与数据清洗——parse.cc

#include <iostream>
#include <string>
#include <vector>
#include <fstream>
#include <boost/filesystem.hpp>
#include "util.hpp"

const std::string src_path = "../data/input/html";
const std::string output_path = "../data/raw_html/raw.txt";

struct File_Info
{
    std::string title;
    std::string content;
    std::string url;
};

bool EnumFiles(const std::string &path, std::vector<std::string> &file_lists)
{
    // 为了方便使用boost的filesystem库，定义一个别名
    namespace fs = boost::filesystem;
    fs::path rpath(path);
    // 判断路径是否存在
    if (!fs::exists(rpath))
    {
        std::cerr << path << " Path not exists!" << std::endl;
        return false;
    }

    fs::recursive_directory_iterator end_iter; // 定义一个空的迭代器，作为结束标志
    for (fs::recursive_directory_iterator iter(rpath); iter != end_iter; ++iter)
    {
        // 判断是否是普通文件（.html是普通文件）
        if (!fs::is_regular_file(*iter))
            continue;
        // 判断文件后缀是否是.html
        if (iter->path().extension() != ".html")
            continue;
        // 到这里的文件都是.html普通文件
        file_lists.push_back(iter->path().string());
    }
    return true;
}

// 获取文件标题
static bool ParseTitle(const std::string &html, std::string &title)
{
    std::size_t begin = html.find("<title>");
    if (begin == std::string::npos)
    {
        std::cerr << "Can't find <title> in html!" << std::endl;
        return false;
    }
    std::size_t end = html.find("</title>");
    if (end == std::string::npos)
    {
        std::cerr << "Can't find </title> in html!" << std::endl;
        return false;
    }
    begin += std::string("<title>").size();
    if (begin > end)
    {
        std::cerr << "begin >= end!" << std::endl;
        return false;
    }
    title = html.substr(begin, end - begin);
    return true;
}

// 获取文件内容
static bool ParseContent(const std::string &file_content, std::string &content)
{
    enum Status
    {
        LABEL,
        CONTENT,
    };

    enum Status s = LABEL;
    // 遍历文件内容的每个字符
    for (char c : file_content)
    {
        switch (s)
        {
        case LABEL:
            if (c == '>') // 标签结束
                s = CONTENT;
            break;
        case CONTENT:
            if (c == '<') // 标签开始
                s = LABEL;
            else
            {
                // 如果是换行符，替换成空格
                if (c == '\n')
                    c = ' ';
                content.push_back(c);
            }
            break;
        default:
            break;
        }
    }
    return true;
}

// 获取文件url
static bool ParseUrl(const std::string &file_path, std::string &url)
{
    const std::string url_head = "https://www.boost.org/doc/libs/1_86_0/doc/html";
    const std::string url_tail = file_path.substr(src_path.size());

    url = url_head + url_tail;
    return true;
}

bool ParseHtml(const std::vector<std::string> &file_lists, std::vector<File_Info> &file_infos)
{
    for (const auto &file : file_lists)
    {
        // 读取文件内容
        std::string result;
        if (!ns_util::FileUtil::ReadFile(file, result))
        {
            std::cerr << "ReadFile failed! file: " << file << std::endl;
            return false;
        }

        File_Info Doc;
        // 获取文件标题
        if (!ParseTitle(result, Doc.title))
        {
            std::cerr << "ParseTitle failed! file: " << file << std::endl;
            return false;
        }

        // 获取文件内容
        if (!ParseContent(result, Doc.content))
        {
            std::cerr << "ParseContent failed! file: " << file << std::endl;
            return false;
        }

        // 获取文件url
        if (!ParseUrl(file, Doc.url))
        {
            std::cerr << "ParseUrl failed! file: " << file << std::endl;
            return false;
        }
        file_infos.push_back(Doc);
    }
    return true;
}

bool SaveFile(const std::string &output_path, const std::vector<File_Info> &file_infos)
{
    // 打开文件
    std::ofstream output(output_path, std::ios::out | std::ios::binary);
    if (!output.is_open())
    {
        std::cerr << "Open output file failed!" << std::endl;
        return false;
    }
    // 循环遍历每个文件，将title,content,url写入到文件中
    for (const auto &file : file_infos)
    {
        output << file.title << '\3' << file.content << '\3' << file.url << '\n';
        // break; // 测试结果是否正确，只输入一个文件内容
    }
    // 关闭文件
    output.close();
    return true;
}

int main()
{
    std::vector<std::string> file_lists;

    // 第一步，获取文件列表，方便之后的文件内容处理
    if (!EnumFiles(src_path, file_lists))
    {
        std::cerr << "EnumFiles failed!" << std::endl;
        return -1;
    }
    // for (const auto &file : file_lists)
    // {
    //     std::cout << file << std::endl;
    // }

    // 第二步， 对每个.html文件进行内容去标签与数据清洗
    std::vector<File_Info> file_infos;
    if (!ParseHtml(file_lists, file_infos))
    {
        std::cerr << "ParseHtml failed!" << std::endl;
        return -1;
    }
    // for (const auto &file : file_infos)
    // {
    //     std::cout << "Title: " << file.title;
    //     std::cout << "Content: " << file.content;
    //     std::cout << "Url: " << file.url << std::endl;
    // }
    // std::cout << "Title: " << file_infos[0].title;
    // std::cout << "Content: " << file_infos[0].content;
    // std::cout << "Url: " << file_infos[0].url << std::endl;

    // 第三步，将解析好的数据写入到output中，'\3'作为title,content,url的分隔符，'\n'作为每个文件的分隔符
    if (!SaveFile(output_path, file_infos))
    {
        std::cerr << "SaveFile failed!" << std::endl;
        return -1;
    }
    
    return 0;
}


ndl;
    // }
    // std::cout << "Title: " << file_infos[0].title;
    // std::cout << "Content: " << file_infos[0].content;
    // std::cout << "Url: " << file_infos[0].url << std::endl;

    // 第三步，将解析好的数据写入到output中，'\3'作为title,content,url的分隔符，'\n'作为每个文件的分隔符
    if (!SaveFile(output_path, file_infos))
    {
        std::cerr << "SaveFile failed!" << std::endl;
        return -1;
    }
    
    return 0;
}