Boost搜索引擎项目

最新推荐文章于 2024-08-03 08:06:24 发布

Achermar

最新推荐文章于 2024-08-03 08:06:24 发布

阅读量59

点赞数 1

文章标签：搜索引擎

本文链接：https://blog.csdn.net/Achermar/article/details/133607074

版权

1. 项目的相关背景

2. 搜索引擎的相关宏观原理

3. 搜索引擎技术栈和项目环境

4. 正排索引 vs 倒排索引 - 搜索引擎具体原理

5.编写数据去标签与数据清洗的模块 Parser

1. 项目的相关背景

公司：百度、搜狗、 360 搜索、头条新闻客户端 - 我们自己实现是不可能的！

站内搜索：搜索的数据更垂直，数据量其实更小

boost 的官网是没有站内搜索的，需要我们自己做一个

2. 搜索引擎的相关宏观原理

3. 搜索引擎技术栈和项目环境

技术栈 : C/C++ C++11, STL, 准标准库 Boost ， Jsoncpp ， cppjieba ， cpp - httplib , 选学： html5 ， css ， js 、

jQuery 、 Ajax

项目环境： Centos 7 云服务器， vim/gcc(g++)/Makefile , vs2019 or vs code

4. 正排索引 vs 倒排索引 - 搜索引擎具体原理

文档 1 ：雷军买了四斤小米

文档 2 ：雷军发布了小米手机

正排索引：就是从文档 ID 找到文档内容 ( 文档内的关键字 )

目标文档进行分词（目的：方便建立倒排索引和查找）：

文档1[雷军买了四斤小米 ]: 雷军/买/四斤/小米/四斤小米
文档2[雷军发布了小米手机]：雷军/发布/小米/小米手机

停止词：了，的，吗， a ， the ，一般我们在分词的时候可以不考虑

倒排索引：根据文档内容，分词，整理不重复的各个关键字，对应联系到文档 ID 的方案

模拟一次查找的过程：

用户输入：小米 -> 倒排索引中查找 -> 提取出文档 ID(1,2) -> 根据正排索引 -> 找到文档的内容 ->

title+conent （ desc ） +url 文档结果进行摘要 -> 构建响应结果

5.编写数据去标签与数据清洗的模块 Parser

boost 官网： https : //www.boost.org/

// 目前只需要 boost_1_78_0/doc/html 目录下的 html 文件，用它来进行建立索引

去标签

[whb@VM-0-3-centos boost_searcher]$ touch parser.cc
//原始数据 -> 去标签之后的数据
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html> <!--这是一个标签-->
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Chapter 30. Boost.Process</title>
<link rel="stylesheet" href="../../doc/src/boostbook.css" type="text/css">
<meta name="generator" content="DocBook XSL Stylesheets V1.79.1">
<link rel="home" href="index.html" title="The Boost C++ Libraries BoostBook Documentation
Subset">
<link rel="up" href="libraries.html" title="Part I. The Boost C++ Libraries (BoostBook
Subset)">
<link rel="prev" href="poly_collection/acknowledgments.html" title="Acknowledgments">
<link rel="next" href="boost_process/concepts.html" title="Concepts">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
<table cellpadding="2" width="100%"><tr>
<td valign="top"><img alt="Boost C++ Libraries" width="277" height="86"
src="../../boost.png"></td>
<td align="center"><a href="../../index.html">Home</a></td>
<td align="center"><a href="../../libs/libraries.htm">Libraries</a></td>
<td align="center"><a href="http://www.boost.org/users/people.html">People</a></td>
<td align="center"><a href="http://www.boost.org/users/faq.html">FAQ</a></td>
<td align="center"><a href="../../more/index.htm">More</a></td>
</tr></table>
.........
// <> : html的标签，这个标签对我们进行搜索是没有价值的，需要去掉这些标签，一般标签都是成对出现的！
[whb@VM-0-3-centos data]$ mkdir raw_html
[whb@VM-0-3-centos data]$ ll
total 20
drwxrwxr-x 60 whb whb 16384 Mar 24 16:49 input //这里放的是原始的html文档
drwxrwxr-x 2 whb whb 4096 Mar 24 16:56 raw_html //这是放的是去标签之后的干净文档
[whb@VM-0-3-centos input]$ ls -Rl | grep -E '*.html' | wc -l
8141
目标：把每个文档都去标签，然后写入到同一个文件中！每个文档内容不需要任何\n！文档和文档之间用 \3 区分
version1：
类似：XXXXXXXXXXXXXXXXX\3YYYYYYYYYYYYYYYYYYYYY\3ZZZZZZZZZZZZZZZZZZZZZZZZZ\3
采用下面的方案：
version2: 写入文件中，一定要考虑下一次在读取的时候，也要方便操作!
类似：title\3content\3url \n title\3content\3url \n title\3content\3url \n ...
方便我们getline(ifsream, line)，直接获取文档的全部内容：title\3content\3url

编写 parser

//代码的基本结构：
#include <iostream>
#include <string>
#include <vector>
//是一个目录，下面放的是所有的html网页
const std::string src_path = "data/input/";
const std::string output = "data/raw_html/raw.txt";
typedef struct DocInfo{
std::string title; //文档的标题
std::string content; //文档内容
std::string url; //该文档在官网中的url
}DocInfo_t;
//const &: 输入
//*: 输出
//&：输入输出
bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list);
bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t>
*results);
bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output);
int main()
{
std::vector<std::string> files_list;
//第一步: 递归式的把每个html文件名带路径，保存到files_list中，方便后期进行一个一个的文件进行读取
if(!EnumFile(src_path, &files_list)){
std::cerr << "enum file name error!" << std::endl;
return 1;
}
//第二步: 按照files_list读取每个文件的内容，并进行解析
std::vector<DocInfo_t> results;
if(!ParseHtml(files_list, &results)){
std::cerr << "parse html error" << std::endl;
return 2;
}
//第三步: 把解析完毕的各个文件内容，写入到output,按照\3作为每个文档的分割符
if(!SaveHtml(results, output)){
std::cerr << "sava html error" << std::endl;
return 3;
}
return 0;
}
bool EnumFile(const std::string &src_path, std::vector<std::string> *files_list)
{
return true;
}
bool ParseHtml(const std::vector<std::string> &files_list, std::vector<DocInfo_t> *results)
{
return true;
}
bool SaveHtml(const std::vector<DocInfo_t> &results, const std::string &output)
{
return true;
}

boost 开发库的安装

$ sudo yum install - y boost - devel

6.编写建立索引的模块 Index

//inidex.hpp基本结构
#pragma once
#include <iostream>
#include <string>
#include <vector>
#include <unordered_map>
namespace ns_index{
struct DocInfo{
std::string title; //文档的标题
std::string content; //文档对应的去标签之后的内容
std::string url; //官网文档url
uint64_t doc_id; //文档的ID，暂时先不做过多理解
};
struct InvertedElem{
uint64_t doc_id;
std::string word;
int weight;
};
//倒排拉链
typedef std::vector<InvertedElem> InvertedList;
class Index{
private:
//正排索引的数据结构用数组，数组的下标天然是文档的ID
std::vector<DocInfo> forward_index; //正排索引
//倒排索引一定是一个关键字和一组(个)InvertedElem对应[关键字和倒排拉链的映射关系]
std::unordered_map<std::string, InvertedList> inverted_index;
public:
Index(){}
~Index(){}
public:
//根据doc_id找到找到文档内容
DocInfo *GetForwardIndex(uint64_t doc_id)
{
return nullptr;
}
//根据关键字string，获得倒排拉链
InvertedList *GetInvertedList(const std::string &word)
{
return nullptr;
}
//根据去标签，格式化之后的文档，构建正排和倒排索引
//data/raw_html/raw.txt
bool BuildIndex(const std::string &input) //parse处理完毕的数据交给我
{
return true;
}
};
}

建立正排的基本代码

DocInfo *BuildForwardIndex(const std::string &line)
{
//1. 解析line，字符串切分
//line -> 3 string, title, content, url
std::vector<std::string> results;
const std::string sep = "\3"; //行内分隔符
ns_util::StringUtil::CutString(line, &results, sep);
//ns_util::StringUtil::CutString(line, &results, sep);
if(results.size() != 3){
return nullptr;
}
//2. 字符串进行填充到DocIinfo
DocInfo doc;
doc.title = results[0]; //title
doc.content = results[1]; //content
doc.url = results[2]; ///url
//先进行保存id，在插入，对应的id就是当前doc在vector中的下标!
doc.doc_id = forward_index.size();
//3. 插入到正排索引的vector
forward_index.push_back(std::move(doc)); //doc,html文件内容
return &forward_index.back();
}

建立倒排

//原理:
struct InvertedElem{
uint64_t doc_id;
std::string word;
int weight;
};
//倒排拉链
typedef std::vector<InvertedElem> InvertedList;
//倒排索引一定是一个关键字和一组(个)InvertedElem对应[关键字和倒排拉链的映射关系]
std::unordered_map<std::string, InvertedList> inverted_index;
//我们拿到的文档内容
struct DocInfo{
std::string title; //文档的标题
std::string content; //文档对应的去标签之后的内容
std::string url; //官网文档url
uint64_t doc_id; //文档的ID，暂时先不做过多理解
};
//文档：

7. 编写搜索引擎模块 Searcher

#include "index.hpp"
namespace ns_searcher{
class Searcher{
private:
ns_index::Index *index; //供系统进行查找的索引
public:
Searcher(){}
~Searcher(){}
public:
void InitSearcher(const std::string &input)
{
//1. 获取或者创建index对象
//2. 根据index对象建立索引
}
//query: 搜索关键字
//json_string: 返回给用户浏览器的搜索结果
void Search(const std::string &query, std::string *json_string)
{
//1.[分词]:对我们的query进行按照searcher的要求进行分词
//2.[触发]:就是根据分词的各个"词"，进行index查找
//3.[合并排序]：汇总查找结果，按照相关性(weight)降序排序
//4.[构建]:根据查找出来的结果，构建json串 -- jsoncpp
}
};
}

8. 编写http_server 模块

cpp - httplib 库： https : //gitee.com/zhangkt1995/cpp-httplib?_from=gitee_search

#include "cpp-httplib/httplib.h"
#include "searcher.hpp"
const std::string input = "data/raw_html/raw.txt";
const std::string root_path = "./wwwroot";
int main()
{
ns_searcher::Searcher search;
search.InitSearcher(input);
httplib::Server svr;
svr.set_base_dir(root_path.c_str());
svr.Get("/s", [&search](const httplib::Request &req, httplib::Response &rsp){
if(!req.has_param("word")){
rsp.set_content("必须要有搜索关键字!", "text/plain; charset=utf-8");
return;
}
std::string word = req.get_param_value("word");
std::cout << "用户在搜索：" << word << std::endl;
std::string json_string;
search.Search(word, &json_string);
rsp.set_content(json_string, "application/json");
//rsp.set_content("你好,世界!", "text/plain; charset=utf-8");
});
svr.listen("0.0.0.0", 8081);
return 0;
}

9. 编写前端模块

Achermar

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Boost搜索引擎项目

站内搜索：搜索的数据更垂直，数据量其实更小。的官网是没有站内搜索的，需要我们自己做一个。，一般我们在分词的时候可以不考虑。我们自己实现是不可能的！文件，用它来进行建立索引。：雷军买了四斤小米。：雷军发布了小米手机。搜索、头条新闻客户端。停止词：了，的，吗，
复制链接

扫一扫