创新实训(五) RAG——知识库迁移+重新配置+数据向量化

plalap

已于 2024-06-24 00:08:56 修改

阅读量1.2k

点赞数 24

分类专栏： 21创新实训文章标签：人工智能

于 2024-05-25 15:47:15 首次发布

本文链接：https://blog.csdn.net/qq_26245471/article/details/139198408

版权

21创新实训专栏收录该内容

15 篇文章 0 订阅

订阅专栏

因为要更换服务器，所以要迁移知识库

配置python环境

服务器上没有配置python环境

配置python环境3.11.8：Python Release Python 3.11.8 | Python.org

在这里插入图片描述

配置完毕：

在这里插入图片描述

git环境配置

官网：https://git-scm.com/ 版本为2.45.1

在这里插入图片描述

安装：
在这里插入图片描述

git安装成功如下：

在这里插入图片描述

vscode安装

进入官网，点击下载：
在这里插入图片描述

更改安装路径：
在这里插入图片描述

创建虚拟环境

使用python内置的venv创建虚拟环境：

python -m venv zhouyi_venv

在这里插入图片描述

激活：

 .\zhouyi_venv\Scripts\activate

退出：deactivate

拉取ZhouYiLLM

git clone https://github.com/Liyanhao1209/ZhouYiLLM.git

报错：

在这里插入图片描述

然后又显示没网：
在这里插入图片描述
回顾之前的配置，clash的代理端口是7890

查看代理服务器：没问题

在这里插入图片描述

修改gitconfig配置：c盘用户的主目录里

修改如下：

[http]
    sslVerify = false
    postBuffer = 524288000
    proxy = 127.0.0.1:7890
[https]
    sslVerify = false
    proxy = 127.0.0.1:7890

重新拉取，成功：

在这里插入图片描述

git clone -b python_scripts https://github.com/Liyanhao1209/ZhouYiLLM.git

git clone -b java_scripts https://github.com/Liyanhao1209/ZhouYiLLM.git

安装langchain chat chat

参考文档进行安装

1.环境配置

首先clone：

git clone https://github.com/chatchat-space/Langchain-Chatchat.git

在这里插入图片描述

然后开启虚拟环境，开启虚拟环境地址为 D:\env\zhouyi_venv\Scripts\activate

#进入目录
cd Langchain-Chatchat

pip install -r requirements.txt 
pip install -r requirements_api.txt
pip install -r requirements_webui.txt

因为requirements中的torch是cpu版本的，所以这里要手动下载GPU的pytorch

首先是查看MVIDIA控制面板，然后点击系统信息，然后组件中查看cuda驱动版本：cuda为12.2版本

在这里插入图片描述

然后去torch找到对应的版本：

随后选择适合自己机器的选项：

Windows系统
Pip包管理
语言python
cuda 12.2

在这里插入图片描述

得到命令如下：

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

2.下载模型

$ git lfs install
$ git clone https://huggingface.co/THUDM/chatglm3-6b
$ git clone https://huggingface.co/BAAI/bge-large-zh

在这里插入图片描述

下载完毕两个模型后都报错：

在这里插入图片描述

fatal: active `post-checkout` hook found during `git clone`:
        D:/longchain/Langchain-Chatchat/chatglm3-6b/.git/hooks/post-checkout
For security reasons, this is disallowed by default.
If this is intentional and the hook should actually be run, please
run the command again with `GIT_CLONE_PROTECTION_ACTIVE=false`
warning: Clone succeeded, but checkout failed.
You can inspect what was checked out with 'git status'
and retry with 'git restore --source=HEAD :/'

出现这个问题的主要原因是因为在这些仓库中存在一个 post-checkout 钩子脚本,该脚本在 Git 出于安全考虑默认情况下是不被允许运行的。

解决方法：

SET GIT_CLONE_PROTECTION_ACTIVE=false
git clone https://huggingface.co/THUDM/chatglm3-6b
git clone https://huggingface.co/BAAI/bge-large-zh

重新下载：成功
在这里插入图片描述

3. 初始化知识库和配置文件

按照下列方式初始化自己的知识库和简单的复制配置文件

python copy_config_example.py
python init_database.py --recreate-vs

运行完毕：

在这里插入图片描述

4. 一键启动

按照以下命令启动项目

python startup.py -a

报错：

在这里插入图片描述

但是pip list有：

在这里插入图片描述

尝试卸了重下：

pip uninstall torch torchvision torchaudio
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

然后pip show torch即可看到正确带有gpu的版本：
在这里插入图片描述

启动服务如下：

在这里插入图片描述

分词与文档切割

目的是处理非结构化文本，对《周易研究》期刊的各个文章进行分词，向量化，建立索引，并存入知识库。

LangChain ChatChat在上传文件时，会根据参数选项自动对文档做分词和向量化。所以可以在本地仅准备原始文本，然后在上传时分词且做向量化。

上传的文件如下：

在这里插入图片描述

上传文件

因为要上传的文件很多且格式各不相同，方便起见，这里选择使用api传输，代码已经写好了，只需要改动配置文件就好了。

首先处理以下源文件里一些不用的数据。

但是源脚本只能读取当前目录的文件，不可以嵌套读取，这里更改如下：

# 使用os.walk(src_paths[i])来遍历src_paths[i]目录及其子目录下的文件和文件夹。os.walk()返回一个生成器，
# 每次迭代时，它会返回当前文件夹路径(dirpath)、当前文件夹中的子文件夹列表(dirnames)以及当前文件夹中的文件列表(filenames)。
# 然后，我们使用os.path.join(dirpath, fn)来获取文件的完整路径，并将其传递给request_api()函数进行处理。
        for i in range(len(src_paths)):
            for dirpath, dirnames, filenames in os.walk(src_paths[i]):
                for fn in filenames:
                    path = os.path.join(dirpath, fn)
                    print(path)
                    request_api(path, api_path)

首先开启虚拟环境，然后执行命令：

python upload-v2.py config/uploadDoc.json

接下来等待即可。

全部文件上传后，知识库如下：

在这里插入图片描述

plalap

关注

24
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
创新实训(五) RAG——知识库迁移+重新配置+数据向量化

因为要上传的文件很多且格式各不相同，方便起见，这里选择使用api传输，代码已经写好了，只需要改动配置文件就好了。首先是查看MVIDIA控制面板，然后点击系统信息，然后组件中查看cuda驱动版本：cuda为12.2版本。因为requirements中的torch是cpu版本的，所以这里要手动下载GPU的pytorch。目的是处理非结构化文本，对《周易研究》期刊的各个文章进行分词，向量化，建立索引，并存入知识库。回顾之前的配置，clash的代理端口是7890。修改gitconfig配置：c盘用户的主目录里。
复制链接

扫一扫

专栏目录