使用Hexo搭建知识库，生成三万篇文档，使用Redisearch改造全文搜索插件

flyasengineer

已于 2024-04-28 10:14:43 修改

阅读量194

点赞数 9

文章标签： python redis fastapi node.js 全文检索

于 2024-04-19 08:51:09 首次发布

本文链接：https://blog.csdn.net/lanse413/article/details/137948966

版权

使用情景

使用hexo初衷，是为了整理学习资源，包括各类电子书，文档，七七八八加起来三四万篇。

面临问题

说重点，用hexo搭建知识库，因为后期还会有很多的书和资料，一个大问题，就是生成速度，Windows下会出现爆内存，文档打开过多的报错，目前没有好的解决方式，所有直接抛弃了。改为Ubuntu下进行生成，同样的问题，但是处理起来很快。其次就是它的全文搜索。先说第一个：

文档生成

下面是海量文档生成的两个问题解决方式:

内存超出

加入环境变量，然后source生效。我这里设置的是28G，电脑配置是32G内存。这里按需设置。

- vi .bashrc,source .bashrc
`export NODE_OPTIONS="--max-old-space-size=28192"`

文档打开过多
先贴一个官方和网上提供的解决方案

$ ulimit -n 10000
ulimit: open files: cannot modify limit: Operation not permitted
It means some system-wide configurations are preventing ulimit to being increased to a certain limit.

To override the limit:

Add the following line to “/etc/security/limits.conf”:
* - nofile 10000

### '*' applies to all users and '-' set both soft and hard limits
The above setting may not apply in some cases, ensure “/etc/pam.d/login” and “/etc/pam.d/lightdm” have the following line. (Ignore this step if those files do not exist)
session required pam_limits.so
If you are on a systemd-based distribution, systemd may override “limits.conf”. To set the limit in systemd, add the following line in “/etc/systemd/system.conf” and “/etc/systemd/user.conf”:
DefaultLimitNOFILE=10000
Reboot

实际，指定同时打开的文档数就行了，就是 -c 这个参数

Error: EMFILE, too many open files
hexo g -c 100

整个生成过程挺漫长的，第一次生成的时候，差不多一个小时，后面新增再生成时间大概二十分钟这样，可能这个c的数值调大会加快很多，这个具体看电脑内存，我设置100的情况，内存到7G左右。

全文搜索改造

先看最终效果

优点：进入页面无需加载文件，不会出现loading条，支持逻辑链接搜索词，轻量，系统资源占用极小，速度快。

hexo本地搜索

local research插件会生成一个search.xml的文件，然后在文件里搜索内容，但是如果像我这种使用场景，那么这个搜索文件会达到200M，即使使用gzip压缩，也在七八十兆这样。

访问页面的时候，这么贵的带宽或者是cdn，都是很大的成本，而且后期拓展不可想象，所有必定是要改成api接口的搜索方式。

第三方搜索

hexo algolia，只在一定范围内免费，包括文档大小和请求次数都有要求。我需要提交的这个文档数，费用不小。而且网站在国外，打开巨慢，别说提交文档了。所以还是自己写一个，其实也比较简单，因为全文搜索引擎也有免费的，初期考虑就是免费，轻量。

自己改造

选择的技术架构如下：

服务端使用Redisearch，Fastapi，然后hexo这部分，主要修改local-search这个js库。

local-seach修改部分

从24行左右，注释源代码以及修改如下，然后修改成接口取数据的方式

源文件路径为：/themes/next/source/js/third-party/search/

      // Perform local searching
      //resultItems = localSearch.getResultItems(keywords);

      const resultItems_rs = [];
      const hitCount=1
      const includedCount=22
      apiurl='http://127.0.0.1:8000/wiki/'
      console.log('搜索词：'+ searchText)
      try {
        await fetch(apiurl+searchText)  
        .then((response) => response.json()) 
        .then((jsondata) => {  
          //jsondata=JSON.parse(data);
          if(jsondata.code == 200){
            jsondata.data.forEach(post=>{
              let resultItem = ''
              let item_url = post.url
              let title = post.title
              let content = post.content+'...'
              resultItem += '<li><a href="'+item_url+'" class="search-result-title">'+title+'</a>';
              resultItem += '<a href="'+item_url+'?highlight='+searchText+'"><p class="search-result">'+content+'</p></a>';
              resultItem += '</li>';
              resultItems_rs.push({
                item: resultItem,
                id  : resultItems.length,
                hitCount,
                includedCount
              });
            })
          }
          else{
            console.log(jsondata.messege)
          }
        });
      } catch (error) {
        resultItems_rs.push({
          item: '<li>搜索服务无响应..请联系管理员，或者发送邮件</li>',
          id  : 0,
          hitCount,
          includedCount
        });
        console.log(jsondata.messege+'error:'+error)
      }


      resultItems=resultItems_rs

服务端

使用python操作redisearch

安装，创建，搜索

# docker 安装并运行 redisearch
`sudo curl -fsSL https://get.docker.com | bash -s docker --mirror Aliyun`
`docker pull redis/redis-stack-server:latest`
`$docker run  -itd  -p 6666:6379 redis/redis-stack-server:latest`

# 通过python库 创建index
from redisearch import Client

    def __init__(self):  
        self.client = Client('myidx', host=config.server_ip, port=6666)  

    def newindex(self):
        try:
            self.client.drop_index()
        except:
            pass
       # 创建index的框架，这里可以把需要检索的属性指定                
       self.client.create_index([TextField('title',weight=5),TextField('body',weight=1)])

# 添加文档docs
 try:
     self.client.add_document(doc_id.lstrip(), body=content, title=title, url=url, language='chinese')  
 except:
     pass

# 按标题搜索，这里注意，paging不指定的话，也会有一个默认值
res_for_title_rs=self.client.search(Query(key).language('chinese').limit_fields('title').paging(0,page_doc).sort_by('title'))

# 按内容搜索
res_for_body=self.client.search(Query(keyword).language('chinese')
                               .with_scores().limit_fields('body').paging(0, page_doc))

整个服务端使用fastapi，有想参看整个源码的，可以留言，如果有需求，会考虑整理以后开源，以便大家学习之用。