网易有道 QAnything 安装部署实验

一、什么是QAnything?

1.1 QAnything

QAnything (Question and Answer based on Anything) 是致力于支持任意格式文件或数据库的本地知识库问答系统,可断网安装使用。

您的任何格式的本地文件都可以往里扔,即可获得准确、快速、靠谱的问答体验。

目前已支持格式: PDF(pdf)Word(docx)PPT(pptx)XLS(xlsx)Markdown(md)电子邮件(eml)TXT(txt)图片(jpg,jpeg,png)CSV(csv)网页链接(html),更多格式,敬请期待...

1.2 特点

  • 数据安全,支持全程拔网线安装使用。
  • 支持跨语种问答,中英文问答随意切换,无所谓文件是什么语种。
  • 支持海量数据问答,两阶段向量排序,解决了大规模数据检索退化的问题,数据越多,效果越好。
  • 高性能生产级系统,可直接部署企业应用。
  • 易用性,无需繁琐的配置,一键安装部署,拿来就用。
  • 支持选择多知识库问答。

1.3 架构

qanything_system
1.3.1 为什么是两阶段检索?

知识库数据量大的场景下两阶段优势非常明显,如果只用一阶段embedding检索,随着数据量增大会出现检索退化的问题,如下图中绿线所示,二阶段rerank重排后能实现准确率稳定增长,即数据越多,效果越好

two stage retrievaal

QAnything使用的检索组件BCEmbedding有非常强悍的双语和跨语种能力,能消除语义检索里面的中英语言之间的差异,从而实现:

1.3.2 一阶段检索(embedding)
模型名称RetrievalSTSPairClassificationClassificationRerankingClustering平均
bge-base-en-v1.537.1455.0675.4559.7343.0537.7447.20
bge-base-zh-v1.547.6063.7277.4063.3854.8532.5653.60
bge-large-en-v1.537.1554.0975.0059.2442.6837.3246.82
bge-large-zh-v1.547.5464.7379.1464.1955.8833.2654.21
jina-embeddings-v2-base-en31.5854.2874.8458.4241.1634.6744.29
m3e-base46.2963.9371.8464.0852.3837.8453.54
m3e-large34.8559.7467.6960.0748.9931.6246.78
bce-embedding-base_v157.6065.7374.9669.0057.2938.9559.43
1.3.3 二阶段检索(rerank)
模型名称Reranking平均
bge-reranker-base57.7857.78
bge-reranker-large59.6959.69
bce-reranker-base_v160.0660.06

二.、开始

在线试用QAnything

2.1 必要条件

2.1.1 For Linux
SystemRequired itemMinimum RequirementNote
Linux amd64NVIDIA GPU Memory>= 4GB (use OpenAI API)Minimum: GTX 1050Ti(use OpenAI API)
Recommended: RTX 3090
NVIDIA Driver Version>= 525.105.17
Docker version>= 20.10.5Docker install
docker compose version>= 2.23.3docker compose install
2.1.2 For Windows with WSL Ubuntu子系统
SystemRequired itemMinimum RequirementNote
Windows with WSL Ubuntu子系统NVIDIA GPU Memory>= 4GB (use OpenAI API)最低: GTX 1050Ti(use OpenAI API)
推荐: RTX 3090
GEFORCE EXPERIENCE>= 546.33GEFORCE EXPERIENCE download
Docker Desktop>= 4.26.1(131620)Docker Desktop for Windows
git-lfsgit-lfs install

2.2 下载安装

2.2.1 下载项目

下载源代码:

git clone https://github.com/netease-youdao/QAnything.git

获取 Embedding 模型:

git clone https://www.modelscope.cn/netease-youdao/QAnything.git
  • 从有道的资源库中下载所需的 Embedding 模型。
  • 解压下载的模型文件,得到一个名为 "models" 的文件夹,其中包含了所需的 embedding 模型。
  • 将解压后的 "models" 文件夹放置于 QAnything 的根目录下。

下载大语言模型:

  • 推荐使用 "通义千问" 的大语言模型。
  • 下载所需的大语言模型,并将其存放在 QAnything 的 "assets/custom_models/" 文件夹中。

MiniChat-2-3B

git clone https://www.modelscope.cn/netease-youdao/MiniChat-2-3B.git

Qwen-7B

git clone https://www.modelscope.cn/netease-youdao/Qwen-7B-QAnything.git
2.2.2 QAnything 服务启动命令用法

用法:

bash run.sh [-c <llm_api>] [-i <device_id>] [-b <runtime_backend>] [-m <model_name>] [-t <conv_template>] [-p <tensor_parallel>] [-r <gpu_memory_utilization>] [-h]

-c <llm_api>:指定LLM API模式,选项为 {local, cloud},默认为 'local'。若设置为 '-c cloud',请首先手动设置环境变量 {OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH} 到 .env 文件中。
-i <device_id>:指定GPU设备ID。
-b <runtime_backend>:指定LLM推理运行时后端,选项为 {default, hf, vllm}。
-m <model_name>:指定要加载的公共LLM模型名称,用于通过FastChat serve API使用,选项为 {Qwen-7B-Chat, deepseek-llm-7b-chat, ...}。
-t <conv_template>:指定使用公共LLM模型时的会话模板,选项为 {qwen-7b-chat, deepseek-chat, ...}。
-p <tensor_parallel>:使用选项 {1, 2} 设置vllm后端的张量并行参数,默认为1。
-r <gpu_memory_utilization>:指定vllm后端的gpu_memory_utilization参数(0,1],默认为0.81。
-h:显示帮助信息。
服务启动命令GPUsLLM Runtime BackendLLM model
bash ./run.sh -c cloud -i 0 -b default1OpenAI APIOpenAI API
bash ./run.sh -c local -i 0 -b default1FasterTransformerQwen-7B-QAnything
bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat1Huggingface TransformersPublic LLM (e.g., MiniChat-2-3B)
bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.811vllmPublic LLM (e.g., MiniChat-2-3B)
bash ./run.sh -c local -i 0,1 -b default2FasterTransformerQwen-7B-QAnything
bash ./run.sh -c local -i 0,1 -b hf -m MiniChat-2-3B -t minichat2Huggingface TransformersPublic LLM (e.g., MiniChat-2-3B)
bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.812vllmPublic LLM (e.g., MiniChat-2-3B)
bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.812vllmPublic LLM (e.g., MiniChat-2-3B)
Note: 你可以根据自己的设备条件选择最合适的服务启动命令。
(1) 当设置 "-i 0,1" 时,本地嵌入/重新排序将在设备 gpu_id_1 上运行,否则默认使用 gpu_id_0
(2) 当设置 "-c cloud" 时,将使用本地Embedding/Rerank和OpenAI LLM API,仅需要约4GB VRAM(适用于GPU设备VRAM <= 8GB)。
(3) 使用OpenAI LLM API时,您将被要求立即输入{OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH}。
(4) "-b hf" 是运行公共LLM推理的最推荐方式,但性能较差。
(5) 在选择QAnything系统的公共Chat LLM时,应考虑更合适的PROMPT_TEMPLATE设置,以考虑不同的LLM模型。
(6) 支持使用Huggingface Transformers/vllm后端的FastChat API的公共LLM列表位于 "/path/to/QAnything/third_party/FastChat/fastchat/conversation.py" 中。

支持使用 FastChat API 和 Huggingface Transformers/vllm 运行时后端的 Pulic LLM

model_nameconv_templateSupported Pulic LLM List
Qwen-7B-QAnythingqwen-7b-qanythingQwen-7B-QAnything
Qwen-1_8B-Chat/Qwen-7B-Chat/Qwen-14B-Chatqwen-7b-chatQwen
Baichuan2-7B-Chat/Baichuan2-13B-Chatbaichuan2-chatBaichuan2
MiniChat-2-3BminichatMiniChat
deepseek-llm-7b-chatdeepseek-chatDeepseek
Yi-6B-ChatYi-34b-chatYi
chatglm3-6bchatglm3ChatGLM3
... check or add conv_template for more LLMs in "/path/to/QAnything/third_party/FastChat/fastchat/conversation.py"
2.2.3 服务启动命令示例:
  1. 在单GPU上使用Huggingface transformers运行时的FastChat API运行QAnything:
  2. 推荐用于具有 VRAM <= 16GB 的 GPU 设备

1.1 运行Qwen-7B-QAnything
## Step 1. Download the public LLM model (e.g., Qwen-7B-QAnything) and save to "/path/to/QAnything/assets/custom_models"
## (Optional) Download Qwen-7B-QAnything from ModelScope: https://www.modelscope.cn/models/netease-youdao/Qwen-7B-QAnything
## (Optional) Download Qwen-7B-QAnything from Huggingface: https://huggingface.co/netease-youdao/Qwen-7B-QAnything
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/netease-youdao/Qwen-7B-QAnything

# Step 2. Execute the service startup command.  Here we use "-b hf" to specify the Huggingface transformers backend.
## Here we use "-b hf" to specify the transformers backend that will load model in 8 bits but do bf16 inference as default for saving VRAM.
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b hf -m Qwen-7B-QAnything -t qwen-7b-qanything



1.2 运行一个公共的 LLM 模型(例如 MiniChat-2-3B)
## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/GeneZC/MiniChat-2-3B

## Step 2. Execute the service startup command.  Here we use "-b hf" to specify the Huggingface transformers backend.
## Here we use "-b hf" to specify the transformers backend that will load model in 8 bits but do bf16 inference as default for saving VRAM.
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat
  1. 在单GPU上使用带有 vllm 运行时后端的 FastChat API 运行 QAnything:
2.1 运行 Qwen-7B-QAnything
## Step 1. Download the public LLM model (e.g., Qwen-7B-QAnything) and save to "/path/to/QAnything/assets/custom_models"
## (Optional) Download Qwen-7B-QAnything from ModelScope: https://www.modelscope.cn/models/netease-youdao/Qwen-7B-QAnything
## (Optional) Download Qwen-7B-QAnything from Huggingface: https://huggingface.co/netease-youdao/Qwen-7B-QAnything
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/netease-youdao/Qwen-7B-QAnything

## Step 2. Execute the service startup command.  Here we use "-b vllm" to specify the vllm backend.
## Here we use "-b vllm" to specify the vllm backend that will do bf16 inference as default.
## Note you should adjust the gpu_memory_utilization yourself according to the model size to avoid out of memory (e.g., gpu_memory_utilization=0.81 is set default for 7B. Here, gpu_memory_utilization is set to 0.85 by "-r 0.85").
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b vllm -m Qwen-7B-QAnything -t qwen-7b-qanything -p 1 -r 0.85


2.2 运行一个公共的 LLM 模型(例如 MiniChat-2-3B)
## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/GeneZC/MiniChat-2-3B

## Step 2. Execute the service startup command. 
## Here we use "-b vllm" to specify the vllm backend that will do bf16 inference as default.
## Note you should adjust the gpu_memory_utilization yourself according to the model size to avoid out of memory (e.g., gpu_memory_utilization=0.81 is set default for 7B. Here, gpu_memory_utilization is set to 0.5 by "-r 0.5").
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.5

## (Optional) Step 2. Execute the service startup command to specify the vllm backend by "-i 0,1 -p 2". It will do faster inference by setting a tensor parallel mode on 2 GPUs.
## bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.5
  1. 在多GPU上使用FastChat API和vllm后端启动QAnything,并设置张量并行参数为2:
cd /path/to/QAnything
bash ./run.sh -c local -i 0,1 -b vllm -m Qwen-7B-QAnything -t qwen-7b-qanything -p 2 -r 0.85

2.3 开始使用

前端页面

运行成功后,即可在浏览器输入以下地址进行体验。

  • 前端地址: http://your_host:5052/qanything/
API

如果想要访问API接口,请参考下面的地址:

  • API address: http://your_host:8777/api/
  • For detailed API documentation, please refer to QAnything API 文档
DEBUG
如果想要查看相关日志,请查看QAnything/logs/debug_logs目录下的日志文件。

debug.log
用户请求处理日志
sanic_api.log
后端服务运行日志
llm_embed_rerank_tritonserver.log(单卡部署)
LLM embedding和rerank tritonserver服务启动日志
llm_tritonserver.log(多卡部署)
LLM tritonserver服务启动日志
embed_rerank_tritonserver.log(多卡部署或使用openai接口)
embedding和rerank tritonserver服务启动日志
rerank_server.log
rerank服务运行日志
ocr_server.log
OCR服务运行日志
npm_server.log
前端服务运行日志
llm_server_entrypoint.log
LLM中转服务运行日志
fastchat_logs/*.log
FastChat服务运行日志
关闭服务
bash close.sh

2.4 离线部署linux

# 先在联网机器上下载docker镜像
docker pull quay.io/coreos/etcd:v3.5.5
docker pull minio/minio:RELEASE.2023-03-20T20-16-18Z
docker pull milvusdb/milvus:v2.3.4
docker pull mysql:latest
docker pull freeren/qanything:v1.2.1

# 打包镜像
docker save quay.io/coreos/etcd:v3.5.5 minio/minio:RELEASE.2023-03-20T20-16-18Z milvusdb/milvus:v2.3.4 mysql:latest freeren/qanything:v1.2.1 -o qanything_offline.tar

# 下载QAnything代码
wget https://github.com/netease-youdao/QAnything/archive/refs/heads/master.zip

# 把镜像qanything_offline.tar和代码QAnything-master.zip拷贝到断网机器上
cp QAnything-master.zip qanything_offline.tar /path/to/your/offline/machine

# 在断网机器上加载镜像
docker load -i qanything_offline.tar

# 解压代码,运行
unzip QAnything-master.zip
cd QAnything-master
bash run.sh
参考链接

QAnything github: https://github.com/netease-youdao/QAnything
QAnything gitee: QAnything: QAnything (Question and Answer based on Anything) 是致力于支持任意格式文件或数据库的本地知识库问答系统,可断网安装使用。GitHub - QwenLM/Qwen: The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.QAnything

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值