项目日记(2): boost搜索引擎

HuaJiahhh

已于 2024-06-02 17:02:45 修改

阅读量589

点赞数 19

分类专栏：项目日记文章标签：搜索引擎 c++ 网络

于 2024-05-24 15:34:15 首次发布

本文链接：https://blog.csdn.net/huajiahhhh/article/details/139128540

版权

项目日记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1. 索引开始工作

2. 获取倒排索引和正排索引

3. 建立索引

4. 建立正排和倒排索引

5. index单例实现

6. 使用cppjieba分词

1. 索引开始工作

1. 首先DocInfo存放文档的基本信息, InvertedElem存放排序信息. forward_index是正排索引, inverted_index是倒排索引, 采用拉链法将关键字和文档信息(不止一个文档所以用数组存放)联系.

DocInfo和InvertedElem为啥同时都要出现doc_in?

这里先埋下一个伏笔等下解答.

#pragma once 

#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
using namespace std;


namespace ns_index
{
    //每个文档的信息.
    struct DocInfo
    {
        string title;
        string content;
        string url;
        uint64_t doc_id; //文档id
    };

    //排序信息
    struct InvertedElem
    {
        uint64_t doc_id; //文档id
        string word; //文档关键字
        int weight;  //文档关键字的权值
        InvertedElem()
            :weight(0)
        {}
    };
        //倒排拉链
        typedef vector<InvertedElem> InvertedList;
    class Index
    {
    private:
        //正排索引;
        vector<DocInfo> forward_index;

        //倒排索引;一个关键词和多个文档对应,构成映射关系.
        unordered_map<string, InvertedList> inverted_index;

    public:
        
    };

}

2. 获取倒排索引和正排索引

1. GetForwardIndex: 获取正排索引, 一个文档id和文档内容相关联, 输入doc_id进行查找, 因为下标和doc_id是对应关系, 所以返回forward_list文档详细内容.

2. GetInvertedIndex: 获取倒排索引, 一个关键字对应多个文档, 因为inverted_index是unordered_map类型, 由关键字就可以找到对应的多个文档内容.


        //根据doc_id文档id找到文档内容(正排索引)
        //因为索引id和forward_index下标是对应的.
        DocInfo* GetForwardIndex(uint64_t doc_id)
        {
            if(doc_id >= forward_index.size())
            {
                cerr << "doc_id out range, error!" << endl;
                return nullptr;
            }
            return &forward_index[doc_id];
        }

        //根据关键字string, 获得倒排拉链, 因为关键字对应多个文档, 返回数组Inverted_list
        InvertedList* GetInvertedIndex(const string& word)
        {
            //根据word关键字查找文档id.
            cout << "开始建立倒排索引" << endl;
            auto iter = inverted_index.find(word);
            if(iter == inverted_index.end())
            {
                cerr << word << "have no InvertedList" << endl;
                return nullptr;
            }
            cout << "建立结束" << endl;
            return &(iter->second);
        }

3. 建立索引

建立索引包含正排和倒排索引, 首先参数input是文档内容的引用, 打开输入文件流进行在input里面写入, 从in里面获取字符串放到line里面, 然后建立正排索引, 再建立倒排索引.

//建立索引.
        bool BuildIndex(const string& input)
        {
            //输入文件流类, 打开input进行写入.
            ifstream in(input, ios::in | ios::binary);
            //打开失败判断.
            if(!in.is_open())
            {
                cerr << "sorry, " << input << "open error" << endl;
                return false;
            }

            //将输入的值存放到line里面.
            string line;
            int count = 0;
            while(getline(in, line))
            {
                //建立正排索引, 返回文档的信息.
                DocInfo* doc = BuildForwardIndex(line);
                if(nullptr == doc)
                {
                    cerr << "build " << line << "error" << endl;
                    continue;
                }
                cout << "建立正排索引成功" << endl;

                //建立倒排索引
                cout << "建立倒排索引" << endl;
                BuildInvertedIndex(*doc);
                cout << "建立倒排索引成功" << endl;

                count++;
                if(count % 50 == 0)
                {
                    cout << "当前已经建立索引文档: " << count << endl;
                }
            }
            return true;
        }

4.建立正排和倒排索引

1. BuildForwardIndex: 建立正排索引

(1) 就是将文档内容line进行解析分割. 使用到在util.hpp里面封装好的接口, StringUtil::Spilt进行分割.

(2) 将分割出来的字符串放到DocInfo里面.

(3) 插入到正排索引的数组当中.使用move是提高效率减少拷贝.

(4)最后返回的是DocInfo数组里面刚刚插入的一个成员.

2. BuildInvertedIndex: 建立倒排索引

自定义相关性: word_cnt: 对文档的title和content进行统计.

因为一个关键字可能对应多个文档id, 所以我们使用unodered_map;

然后分别对title, content进行分词统计;

to_lower: 特别说明一下因为我们在使用关键字进行搜索的时候是不区分大小写的, 所以一起看作小写;

 DocInfo* BuildForwardIndex(const string& line)
        {
            //1.解析文档内容line(title, content, url), 字符串分割;
            vector<string> results;
            const string sep = "\3";            
            ns_util::StringUtil::Split(line, &results, sep);
            //分割得到title, content, url.
            if(results.size() != 3)
            {
                return nullptr;
            }

            //2.字符串填充到DocInfo
            DocInfo doc;
            doc.title = results[0];
            doc.content = results[1];
            doc.url = results[2];
            doc.doc_id = forward_index.size();

            //3.插入到正排索引vector;
            forward_index.push_back(move(doc));
            return &forward_index.back();
        }

        bool BuildInvertedIndex(const DocInfo& doc)
        {
            //建立相关性
            struct word_cnt
            {
                int title_cnt;
                int content_cnt;

                word_cnt()
                    :title_cnt(0)
                    ,content_cnt(0)
                {}
            };

            //词频映射;
            unordered_map<string, word_cnt> word_map;
            
            //对标题分词;
            vector<string> title_words;
            ns_util::JiebaUtil::CutString(doc.title, &title_words);
              cout << "标题分词成功!" << endl;

            //对标题进行词频进行统计;
            for(string s : title_words)
            {
                boost::to_lower(s);
                word_map[s].title_cnt++;
            }

            //对内容分词;
            vector<string> content_words;
            ns_util::JiebaUtil::CutString(doc.content, &content_words);
            cout << "内容分词成功!" << endl;

            //对内容进行词频进行统计;
            for(string s : content_words)
            {
                boost::to_lower(s);
                word_map[s].content_cnt++;
            }
#define X 10
#define Y 1

            for(auto& word_pair : word_map)
            {
                InvertedElem item;
                item.doc_id = doc.doc_id;
                item.word = word_pair.first;
                item.weight = X*word_pair.second.title_cnt + Y*word_pair.second.content_cnt;
                InvertedList& inverted_list = inverted_index[word_pair.first];
                inverted_list.push_back(move(item));
            }
            return true;
        }
    };
    Index* Index::instance = nullptr;
    mutex Index::mtx;
}

5. Index单例实现

Index的拷贝构造和赋值重载都delete掉, 使用GetInstance进行获取单例.

public:
        Index(){};
        Index(const Index&) = delete;
        Index& operator=(const Index&) = delete;

        static Index* instance;
        static mutex mtx;
public: 
        ~Index(){};
        static Index* GetInstance()
        {
            if(nullptr == instance)
            {
                mtx.lock();
                if(nullptr == instance)
                {
                    instance = new Index();
                }
                mtx.unlock();
            }
            return instance;
        }

    Index* Index::instance = nullptr;
    mutex Index::mtx;

6. 使用cppjieba分词

1. clonejieba:

打开getcode网站, 搜索cppjieba, 进行clone到本地shell里面即可. 以及建立软连接等.

2.jieba的使用:

(1) StringUtil进行文档的分割;

(2) JiebaUtil封装jieba, 实现创建jieba, 封装jieba单例, jieba的初始化, jieba的分词; 封装jieba的分词.

    // 进行文档解析分割, title, content, url.
    class StringUtil
    {
    public:
        // target是需要分割地文档, 最后按照sep分隔符进行切割到out里面.
        static void Split(const string &target, vector<string> *out, const string &sep)
        {
            boost::split(*out, target, boost::is_any_of(sep), boost::token_compress_on);
        }
    };

    // jieba使用需要用到的.
    const char *const DICT_PATH = "./dict/jieba.dict.utf8";
    const char *const HMM_PATH = "./dict/hmm_model.utf8";
    const char *const USER_DICT_PATH = "./dict/user.dict.utf8";
    const char *const IDF_PATH = "./dict/idf.utf8";
    const char *const STOP_WORD_PATH = "./dict/stop_words.utf8";

    // 调用jieba分词;
    class JiebaUtil
    {
     private:
            //static cppjieba::Jieba jieba;
            cppjieba::Jieba jieba;
            std::unordered_map<std::string, bool> stop_words;
        private:
            JiebaUtil():jieba(DICT_PATH, HMM_PATH, USER_DICT_PATH, IDF_PATH, STOP_WORD_PATH)
            {
                cout << "创建JiebaUtil成功" << endl;
            }
            JiebaUtil(const JiebaUtil&) = delete;

            static JiebaUtil *instance;
            ~JiebaUtil()
            {}
    public:
        // jieba单例;
        static JiebaUtil* get_instance()
        {
            static mutex mtx;
            if (nullptr == instance)
            {
                //cout << "开始创建index实例" << endl;
                mtx.lock();
                if (nullptr == instance)
                {
                    instance = new JiebaUtil();
                    instance->InitJiebaUtil();
                }
                mtx.unlock();
            }
            return instance;
        }

        // jieba初始化
        void InitJiebaUtil()
        {
            ifstream in(STOP_WORD_PATH);
            if (!in.is_open())
            {
                cerr << "load stop words file error" << endl;
                return;
            }

            string line;
            while (getline(in, line))
            {
                stop_words.insert({line, true});
            }
            in.close();
        }

        // jieba分词; 将处理好的src进行切割分词.然后放到out里面.
        void CutStringHelper(const string &src, vector<string> *out)
        {
            jieba.CutForSearch(src, *out);
            for (auto iter = out->begin(); iter != out->end(); )
            {
                auto it = stop_words.find(*iter);
                if (it != stop_words.end())
                {
                    iter = out->erase(iter);
                }
                else
                {
                    iter++;
                }
            }
        }

    public:
        // 封装jieba分词CutStringHelper;
        static void CutString(const string &src, vector<string> *out)
        {

            cout << "开始 分词" << endl;
            ns_util::JiebaUtil::get_instance()->CutStringHelper(src, out);
            cout << "分词结束" << endl;

        }
    };

    JiebaUtil *JiebaUtil::instance = nullptr;
    //cppjieba::Jieba JiebaUtil::jieba(DICT_PATH, HMM_PATH, USER_DICT_PATH, IDF_PATH, STOP_WORD_PATH);
};

后言: 实现类的封装的过程有点复杂, 而且命名中有点乱, 自己可以梳理清楚就很明白了.

HuaJiahhh

关注

19
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
项目日记(2): boost搜索引擎

建立索引包含正排和倒排索引, 首先参数input是文档内容的引用, 打开输入文件流进行在input里面写入, 从in里面获取字符串放到line里面, 然后建立正排索引, 再建立倒排索引.: 获取正排索引, 一个文档id和文档内容相关联, 输入doc_id进行查找, 因为下标和doc_id是对应关系, 所以返回forward_list文档详细内容.(2) JiebaUtil封装jieba, 实现创建jieba, 封装jieba单例, jieba的初始化, jieba的分词;封装jieba的分词.
复制链接

扫一扫

专栏目录