一、什么是QAnything?
1.1 QAnything
QAnything (Question and Answer based on Anything) 是致力于支持任意格式文件或数据库的本地知识库问答系统,可断网安装使用。
您的任何格式的本地文件都可以往里扔,即可获得准确、快速、靠谱的问答体验。
目前已支持格式: PDF(pdf),Word(docx),PPT(pptx),XLS(xlsx),Markdown(md),电子邮件(eml),TXT(txt),图片(jpg,jpeg,png),CSV(csv),网页链接(html),更多格式,敬请期待...
1.2 特点
- 数据安全,支持全程拔网线安装使用。
- 支持跨语种问答,中英文问答随意切换,无所谓文件是什么语种。
- 支持海量数据问答,两阶段向量排序,解决了大规模数据检索退化的问题,数据越多,效果越好。
- 高性能生产级系统,可直接部署企业应用。
- 易用性,无需繁琐的配置,一键安装部署,拿来就用。
- 支持选择多知识库问答。
1.3 架构
1.3.1 为什么是两阶段检索?
知识库数据量大的场景下两阶段优势非常明显,如果只用一阶段embedding检索,随着数据量增大会出现检索退化的问题,如下图中绿线所示,二阶段rerank重排后能实现准确率稳定增长,即数据越多,效果越好。
QAnything使用的检索组件BCEmbedding有非常强悍的双语和跨语种能力,能消除语义检索里面的中英语言之间的差异,从而实现:
- 强大的双语和跨语种语义表征能力【基于MTEB的语义表征评测指标】。
- 基于LlamaIndex的RAG评测,表现SOTA【基于LlamaIndex的RAG评测指标】。
1.3.2 一阶段检索(embedding)
模型名称 | Retrieval | STS | PairClassification | Classification | Reranking | Clustering | 平均 |
---|---|---|---|---|---|---|---|
bge-base-en-v1.5 | 37.14 | 55.06 | 75.45 | 59.73 | 43.05 | 37.74 | 47.20 |
bge-base-zh-v1.5 | 47.60 | 63.72 | 77.40 | 63.38 | 54.85 | 32.56 | 53.60 |
bge-large-en-v1.5 | 37.15 | 54.09 | 75.00 | 59.24 | 42.68 | 37.32 | 46.82 |
bge-large-zh-v1.5 | 47.54 | 64.73 | 79.14 | 64.19 | 55.88 | 33.26 | 54.21 |
jina-embeddings-v2-base-en | 31.58 | 54.28 | 74.84 | 58.42 | 41.16 | 34.67 | 44.29 |
m3e-base | 46.29 | 63.93 | 71.84 | 64.08 | 52.38 | 37.84 | 53.54 |
m3e-large | 34.85 | 59.74 | 67.69 | 60.07 | 48.99 | 31.62 | 46.78 |
bce-embedding-base_v1 | 57.60 | 65.73 | 74.96 | 69.00 | 57.29 | 38.95 | 59.43 |
- 更详细的评测结果详见Embedding模型指标汇总。
1.3.3 二阶段检索(rerank)
模型名称 | Reranking | 平均 |
---|---|---|
bge-reranker-base | 57.78 | 57.78 |
bge-reranker-large | 59.69 | 59.69 |
bce-reranker-base_v1 | 60.06 | 60.06 |
- 更详细的评测结果详见Reranker模型指标汇总
二.、开始
2.1 必要条件
2.1.1 For Linux
System | Required item | Minimum Requirement | Note |
---|---|---|---|
Linux amd64 | NVIDIA GPU Memory | >= 4GB (use OpenAI API) | Minimum: GTX 1050Ti(use OpenAI API) Recommended: RTX 3090 |
NVIDIA Driver Version | >= 525.105.17 | ||
Docker version | >= 20.10.5 | Docker install | |
docker compose version | >= 2.23.3 | docker compose install |
2.1.2 For Windows with WSL Ubuntu子系统
System | Required item | Minimum Requirement | Note |
---|---|---|---|
Windows with WSL Ubuntu子系统 | NVIDIA GPU Memory | >= 4GB (use OpenAI API) | 最低: GTX 1050Ti(use OpenAI API) 推荐: RTX 3090 |
GEFORCE EXPERIENCE | >= 546.33 | GEFORCE EXPERIENCE download | |
Docker Desktop | >= 4.26.1(131620) | Docker Desktop for Windows | |
git-lfs | git-lfs install |
2.2 下载安装
2.2.1 下载项目
下载源代码:
git clone https://github.com/netease-youdao/QAnything.git
获取 Embedding 模型:
git clone https://www.modelscope.cn/netease-youdao/QAnything.git
- 从有道的资源库中下载所需的 Embedding 模型。
- 解压下载的模型文件,得到一个名为 "models" 的文件夹,其中包含了所需的 embedding 模型。
- 将解压后的 "models" 文件夹放置于 QAnything 的根目录下。
下载大语言模型:
- 推荐使用 "通义千问" 的大语言模型。
- 下载所需的大语言模型,并将其存放在 QAnything 的 "assets/custom_models/" 文件夹中。
MiniChat-2-3B
git clone https://www.modelscope.cn/netease-youdao/MiniChat-2-3B.git
Qwen-7B
git clone https://www.modelscope.cn/netease-youdao/Qwen-7B-QAnything.git
2.2.2 QAnything 服务启动命令用法
用法:
bash run.sh [-c <llm_api>] [-i <device_id>] [-b <runtime_backend>] [-m <model_name>] [-t <conv_template>] [-p <tensor_parallel>] [-r <gpu_memory_utilization>] [-h]
-c <llm_api>:指定LLM API模式,选项为 {local, cloud},默认为 'local'。若设置为 '-c cloud',请首先手动设置环境变量 {OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH} 到 .env 文件中。
-i <device_id>:指定GPU设备ID。
-b <runtime_backend>:指定LLM推理运行时后端,选项为 {default, hf, vllm}。
-m <model_name>:指定要加载的公共LLM模型名称,用于通过FastChat serve API使用,选项为 {Qwen-7B-Chat, deepseek-llm-7b-chat, ...}。
-t <conv_template>:指定使用公共LLM模型时的会话模板,选项为 {qwen-7b-chat, deepseek-chat, ...}。
-p <tensor_parallel>:使用选项 {1, 2} 设置vllm后端的张量并行参数,默认为1。
-r <gpu_memory_utilization>:指定vllm后端的gpu_memory_utilization参数(0,1],默认为0.81。
-h:显示帮助信息。
服务启动命令 | GPUs | LLM Runtime Backend | LLM model |
---|---|---|---|
bash ./run.sh -c cloud -i 0 -b default | 1 | OpenAI API | OpenAI API |
bash ./run.sh -c local -i 0 -b default | 1 | FasterTransformer | Qwen-7B-QAnything |
bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat | 1 | Huggingface Transformers | Public LLM (e.g., MiniChat-2-3B) |
bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.81 | 1 | vllm | Public LLM (e.g., MiniChat-2-3B) |
bash ./run.sh -c local -i 0,1 -b default | 2 | FasterTransformer | Qwen-7B-QAnything |
bash ./run.sh -c local -i 0,1 -b hf -m MiniChat-2-3B -t minichat | 2 | Huggingface Transformers | Public LLM (e.g., MiniChat-2-3B) |
bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.81 | 2 | vllm | Public LLM (e.g., MiniChat-2-3B) |
bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.81 | 2 | vllm | Public LLM (e.g., MiniChat-2-3B) |
Note: 你可以根据自己的设备条件选择最合适的服务启动命令。
(1) 当设置 "-i 0,1" 时,本地嵌入/重新排序将在设备 gpu_id_1 上运行,否则默认使用 gpu_id_0
(2) 当设置 "-c cloud" 时,将使用本地Embedding/Rerank和OpenAI LLM API,仅需要约4GB VRAM(适用于GPU设备VRAM <= 8GB)。
(3) 使用OpenAI LLM API时,您将被要求立即输入{OPENAI_API_KEY, OPENAI_API_BASE, OPENAI_API_MODEL_NAME, OPENAI_API_CONTEXT_LENGTH}。
(4) "-b hf" 是运行公共LLM推理的最推荐方式,但性能较差。
(5) 在选择QAnything系统的公共Chat LLM时,应考虑更合适的PROMPT_TEMPLATE设置,以考虑不同的LLM模型。
(6) 支持使用Huggingface Transformers/vllm后端的FastChat API的公共LLM列表位于 "/path/to/QAnything/third_party/FastChat/fastchat/conversation.py" 中。
支持使用 FastChat API 和 Huggingface Transformers/vllm 运行时后端的 Pulic LLM
model_name | conv_template | Supported Pulic LLM List |
---|---|---|
Qwen-7B-QAnything | qwen-7b-qanything | Qwen-7B-QAnything |
Qwen-1_8B-Chat/Qwen-7B-Chat/Qwen-14B-Chat | qwen-7b-chat | Qwen |
Baichuan2-7B-Chat/Baichuan2-13B-Chat | baichuan2-chat | Baichuan2 |
MiniChat-2-3B | minichat | MiniChat |
deepseek-llm-7b-chat | deepseek-chat | Deepseek |
Yi-6B-Chat | Yi-34b-chat | Yi |
chatglm3-6b | chatglm3 | ChatGLM3 |
... check or add conv_template for more LLMs in "/path/to/QAnything/third_party/FastChat/fastchat/conversation.py" |
2.2.3 服务启动命令示例:
- 在单GPU上使用Huggingface transformers运行时的FastChat API运行QAnything:
-
推荐用于具有 VRAM <= 16GB 的 GPU 设备
1.1 运行Qwen-7B-QAnything
## Step 1. Download the public LLM model (e.g., Qwen-7B-QAnything) and save to "/path/to/QAnything/assets/custom_models"
## (Optional) Download Qwen-7B-QAnything from ModelScope: https://www.modelscope.cn/models/netease-youdao/Qwen-7B-QAnything
## (Optional) Download Qwen-7B-QAnything from Huggingface: https://huggingface.co/netease-youdao/Qwen-7B-QAnything
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/netease-youdao/Qwen-7B-QAnything
# Step 2. Execute the service startup command. Here we use "-b hf" to specify the Huggingface transformers backend.
## Here we use "-b hf" to specify the transformers backend that will load model in 8 bits but do bf16 inference as default for saving VRAM.
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b hf -m Qwen-7B-QAnything -t qwen-7b-qanything
1.2 运行一个公共的 LLM 模型(例如 MiniChat-2-3B)
## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/GeneZC/MiniChat-2-3B
## Step 2. Execute the service startup command. Here we use "-b hf" to specify the Huggingface transformers backend.
## Here we use "-b hf" to specify the transformers backend that will load model in 8 bits but do bf16 inference as default for saving VRAM.
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b hf -m MiniChat-2-3B -t minichat
- 在单GPU上使用带有 vllm 运行时后端的 FastChat API 运行 QAnything:
2.1 运行 Qwen-7B-QAnything
## Step 1. Download the public LLM model (e.g., Qwen-7B-QAnything) and save to "/path/to/QAnything/assets/custom_models"
## (Optional) Download Qwen-7B-QAnything from ModelScope: https://www.modelscope.cn/models/netease-youdao/Qwen-7B-QAnything
## (Optional) Download Qwen-7B-QAnything from Huggingface: https://huggingface.co/netease-youdao/Qwen-7B-QAnything
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/netease-youdao/Qwen-7B-QAnything
## Step 2. Execute the service startup command. Here we use "-b vllm" to specify the vllm backend.
## Here we use "-b vllm" to specify the vllm backend that will do bf16 inference as default.
## Note you should adjust the gpu_memory_utilization yourself according to the model size to avoid out of memory (e.g., gpu_memory_utilization=0.81 is set default for 7B. Here, gpu_memory_utilization is set to 0.85 by "-r 0.85").
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b vllm -m Qwen-7B-QAnything -t qwen-7b-qanything -p 1 -r 0.85
2.2 运行一个公共的 LLM 模型(例如 MiniChat-2-3B)
## Step 1. Download the public LLM model (e.g., MiniChat-2-3B) and save to "/path/to/QAnything/assets/custom_models"
cd /path/to/QAnything/assets/custom_models
git clone https://huggingface.co/GeneZC/MiniChat-2-3B
## Step 2. Execute the service startup command.
## Here we use "-b vllm" to specify the vllm backend that will do bf16 inference as default.
## Note you should adjust the gpu_memory_utilization yourself according to the model size to avoid out of memory (e.g., gpu_memory_utilization=0.81 is set default for 7B. Here, gpu_memory_utilization is set to 0.5 by "-r 0.5").
cd /path/to/QAnything
bash ./run.sh -c local -i 0 -b vllm -m MiniChat-2-3B -t minichat -p 1 -r 0.5
## (Optional) Step 2. Execute the service startup command to specify the vllm backend by "-i 0,1 -p 2". It will do faster inference by setting a tensor parallel mode on 2 GPUs.
## bash ./run.sh -c local -i 0,1 -b vllm -m MiniChat-2-3B -t minichat -p 2 -r 0.5
- 在多GPU上使用FastChat API和vllm后端启动QAnything,并设置张量并行参数为2:
cd /path/to/QAnything
bash ./run.sh -c local -i 0,1 -b vllm -m Qwen-7B-QAnything -t qwen-7b-qanything -p 2 -r 0.85
2.3 开始使用
前端页面
运行成功后,即可在浏览器输入以下地址进行体验。
- 前端地址: http://
your_host
:5052/qanything/
API
如果想要访问API接口,请参考下面的地址:
- API address: http://
your_host
:8777/api/ - For detailed API documentation, please refer to QAnything API 文档
DEBUG
如果想要查看相关日志,请查看QAnything/logs/debug_logs目录下的日志文件。
debug.log
用户请求处理日志
sanic_api.log
后端服务运行日志
llm_embed_rerank_tritonserver.log(单卡部署)
LLM embedding和rerank tritonserver服务启动日志
llm_tritonserver.log(多卡部署)
LLM tritonserver服务启动日志
embed_rerank_tritonserver.log(多卡部署或使用openai接口)
embedding和rerank tritonserver服务启动日志
rerank_server.log
rerank服务运行日志
ocr_server.log
OCR服务运行日志
npm_server.log
前端服务运行日志
llm_server_entrypoint.log
LLM中转服务运行日志
fastchat_logs/*.log
FastChat服务运行日志
关闭服务
bash close.sh
2.4 离线部署linux
# 先在联网机器上下载docker镜像
docker pull quay.io/coreos/etcd:v3.5.5
docker pull minio/minio:RELEASE.2023-03-20T20-16-18Z
docker pull milvusdb/milvus:v2.3.4
docker pull mysql:latest
docker pull freeren/qanything:v1.2.1
# 打包镜像
docker save quay.io/coreos/etcd:v3.5.5 minio/minio:RELEASE.2023-03-20T20-16-18Z milvusdb/milvus:v2.3.4 mysql:latest freeren/qanything:v1.2.1 -o qanything_offline.tar
# 下载QAnything代码
wget https://github.com/netease-youdao/QAnything/archive/refs/heads/master.zip
# 把镜像qanything_offline.tar和代码QAnything-master.zip拷贝到断网机器上
cp QAnything-master.zip qanything_offline.tar /path/to/your/offline/machine
# 在断网机器上加载镜像
docker load -i qanything_offline.tar
# 解压代码,运行
unzip QAnything-master.zip
cd QAnything-master
bash run.sh
参考链接
QAnything github: https://github.com/netease-youdao/QAnything
QAnything gitee: QAnything: QAnything (Question and Answer based on Anything) 是致力于支持任意格式文件或数据库的本地知识库问答系统,可断网安装使用。GitHub - QwenLM/Qwen: The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.QAnything