环境准备
- RTX4090
- win10专业版系统
- wsl 安装ubuntu
- 设置wsl 为version 2
- 参考链接:https://blog.csdn.net/SUNbrightness/article/details/116783604
- wsl --list -o 查看linux的发行版本
- wsl --install -d Ubuntu22.04 安装ubuntu的版本
- 需要设置用户名和密码,需要记住,不要忘记了
- 设置wsl 为version 2
- git 下载
- mobaXterm 用于远程连接wsl虚拟机
- ubuntu安装python3-pip
- 如果出现安装失败,可以考虑先apt update一下,拉取一下最新的apt仓库再进行就会成功
- 安装nvidia-sim 驱动
- 一定要先执行一次nvidia-msi 看看是否能看到驱动,有可能是可以的,如果可以,则不执行下面的apt命令
- apt install nvidia-utils-535
配置安装git的目录
- 源码地址:git clone https://github.com/netease-youdao/QAnything.git
- embedding地址:git clone https://www.modelscope.cn/netease-youdao/QAnything.git
- 将这个模型下载后解压models.zip,放到QAnything的根目录,这个models就是我们说的RAG框架里的embedding模型,用于表征
- 下载大模型:
- git lfs install
- git clone https://huggingface.co/netease-youdao/Qwen-7B-QAnything
- 说明:大模型地址:git clone https://www.modelscope.cn/netease-youdao/Qwen-7B-QAnything.git是有问题的,需要从huggingface上下载大模型,不然会有各种问题,我试过chatglm3-6b模型是正常的,在执行-b hf是正常的,就是这个QWen模型有问题,我准备从huggingface上试试看,不要从其他地方下载,这个问题我查了好几天,搞不清楚为啥会这种
- 将这个模型放到QAnything/assets/custom_models 这个路径下
执行阶段
- 安装需要的requirement的python包
- apt install update
- sudo apt install python3-pip
- pip install -r requirement.txt
- 查看nvida驱动程序
- 安装nvidia的显卡驱动,apt install nvidia-
- nvidia-smi 查看显卡情况
- 执行bash
- 现将文件的格式转换一下
sed -i "s/\r//" scripts/run_for_local_option.sh sed -i "s/^M//" scripts/run_for_local_option.sh sed -i "s/\r//" scripts/run_for_cloud_option.sh sed -i "s/^M//" scripts/run_for_cloud_option.sh sed -i "s/\r//" run.sh sed -i "s/^M//" run.sh sed -i "s/\r//" close.sh sed -i "s/^M//" close.sh
- 安装第三方库:
- sudo apt-get install jq
- sudo apt-get install bc
- 设置request python包
- pip install requests==2.28.1
- 安装docker文件
sudo apt-get update
sudo apt-get install \
apt-transport-https \
ca-certificates \
curl \
gnupg \
lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo \
"deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compse
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container-experimental.list | sudo tee /etc/apt/sources.list.d/libnvidia-container-experimental.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo service docker stop
sudo service docker start
sudo apt install docker-compose
# 测试docker
sudo docker run --runtime=nvidia --rm -it --name tensorflow-1.14.0 tensorflow/tensorflow:1.14.0-gpu-py3
- 执行脚本文件:
- sudo bash ./run.sh -c local -i 0 -b hf -m Qwen-7B-QAnything -t qwen-7b-qanything(将-b hf的时候会卡住,=经过确定是因为下载的模型有问题我是从魔塔上下载的模型,所以一直gpu加载不动,使用了chatglm3-6b模型之后就正常了)
- 此时会执行一些docker的下载资源,有很多的docker库需要下载
- 注意一下-b 参数需要谨慎选择,有可能是不能使用的,我用chatglm3-6b使用vllm时就执行不出来,用hf就可以,用default就是失败的
问题错误:
-
- docker 执行的时候报错,错误如下:
TypeError: HTTPConnection.request() got an unexpected keyword argument ‘chunked’
docker.errors.DockerException: Error while fetching server API version: HTTPConnection.request() got an unexpected keyword argument ‘chunked’
- 解决方案:更具网址的解决方案:https://github.com/google-deepmind/alphafold/issues/812
- 将requests的python库调整为2.28.1
- 具体操作是:pip install requests==2.28.1
- docker 执行的时候报错,错误如下:
-
2.urllib3.exceptions.ProtocolError: (‘Connection aborted.’, PermissionError(13, ‘Permission denied’))
- 尝试用sudo进行处理,可以解决这个问题
-
- ERROR: for qanything-container-local Cannot start service qanything_local: could not select device driver “nvidia” with capabilities: [[gpu]]
- 可能是因为安装nvida的显卡导致,需要安装nvida的显卡解决
- 尝试方案1:
- 1.官网下载nvida的显卡驱动进行安装,Linux 4-bit无效,Linux aarch64无效,FreeBSD x64无效,方法不可行
-
- 在输入了nvidias-smi之后会有一些推荐,经过查看,安装了版本 apt install nvidia-utils-535的版本
- 3.经过查看nvidia-smi可以看到显卡的版本了
- 参考链接:https://blog.csdn.net/m0_73139694/article/details/135473124
- 参考链接1:https://github.com/THUDM/CodeGeeX/issues/103
-
4.安装了nvidia-smi之后出现NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
- 尝试方法1(无效):
1. sudo add-apt-repository ppa:graphics-drivers/ppa --yes 2. sudo apt update 3. sudo apt install nvidia-driver-550 当前最新的驱动了
- 尝试方法2:
1. sudo apt-get remove --purge '^nvidia-.*' sudo apt-get remove --purge '^libnvidia-.*' sudo apt-get remove --purge '^cuda-.*' 2. wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda_12.3.2_545.23.08_linux.run 3. sudo sh cuda_12.3.2_545.23.08_linux.run
- 开启了hypter_type之后去执行nvida-msi竟然好了
- 参考网页:https://blog.csdn.net/wjinjie/article/details/108997692
- 参考网页:https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver-make-sure-that-the-latest-nvidia-driver-is-installed-and-running/197141/2
- 参考网页:https://forums.developer.nvidia.com/t/nvidia-smi-has-failed-because-it-couldnt-communicate-with-the-nvidia-driver-make-sure-that-the-latest-nvidia-driver-is-installed-and-running/197141
- 尝试方法1(无效):
-
5.docker.errors.DockerException: Error while fetching server API version: (‘Connection aborted.’, FileNotFoundError(2, 'No such file or ,Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
- 原因是docker进程没有启动起来,所以docker服务是不可用的
- 换种方案安装docker: snap install docker
-
5.执行的时候报错:nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
- 方案1 安装nvidia-container:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit sudo snap restart docker # 因为是通过snap安装的,其他可以通过systemctl服务重启(systemctl restart docker) docker run --rm -it --gpus all ubuntu:22.04 nvidia-smi # 失败了,和上面的报错一样 nvidia-smi # 未执行 安装最新的安装方式后就正常了,参考连接:https://blog.csdn.net/SUNbrightness/article/details/116783604
- 方案2(有问题):
- 可能是因为nvida-container的libnvidia-ml.so.1 的位置不对,经过查询这个文件在目录: nvidia-container-cli list 发现是在/usr/lib/wsl中,说明是有的,
- 将这个文件注释掉,方式是因为重复的文件名字导致的问题:
mv /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1.bak mv /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so.1.bak
- 参考链接:https://www.cnblogs.com/devilmaycry812839668/p/17296525.html
- 方案3:安装一个nvida-docker2(有问题)
- sudo docker run --rm -it ls -al /usr/lib/x86_64-linux-gnu/libnv* output 参考这种方式,将nvida的docker直接软链家到这的那个的lib库中
- 参考链接:https://github.com/NVIDIA/nvidia-container-toolkit/issues/289
- 参考链接:https://docs.nvidia.com/ai-enterprise/deployment-guide-vmware/0.1.0/docker.html
- 方案3:安装一个nvida-cuda试试看(方法无效)
- docker run --rm --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi 作者:titan909 https://www.bilibili.com/read/cv31412472/ 出处:bilibili
- 方案4:去掉了docker-window.yaml中的nvidia的选项,就能够执行成功了,但是在rank_embed时报错
- 方案5:将nvidia-smi设置为persistence模式试试看(无效)
- 参考链接:https://github.com/NVIDIA/nvidia-docker/issues/1648
- 方案6
- 参考:https://www.cnblogs.com/dudu/p/18010103 尝试一下
- 方案7:可能是因为使用了snap安装了docker导致的?卸载了snap的docker,再试试看?
- 最终确定问题就是因为使用了snap安装docker出现的问题
- docker安装流程:https://blog.csdn.net/SUNbrightness/article/details/116783604
- 方案1 安装nvidia-container:
-
问题6:please install FlashAttention https://github.com/Dao-AILab/flash-attention
- pip install ninja
- pip install torch
- pip install flash-attn --no-build-isolation
- OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root(需要安装cuda)
- wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda_12.3.2_545.23.08_linux.run
- sudo sh cuda_12.3.2_545.23.08_linux.run
- export CUDA_HOME=/usr/local/cuda-X.X
- 参考链接:https://gist.github.com/Brainiarc7/470a57e5c9fc9ab9f9c4e042d5941a40
- 参考链接:https://blog.csdn.net/OOFFrankDura/article/details/113632416
- 参考链接:https://gist.github.com/Brainiarc7/470a57e5c9fc9ab9f9c4e042d5941a40
- 参考链接:https://github.com/Dao-AILab/flash-attention 一个关于gpu加速开源项目
-
问题7:RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
- 方法一:
- pip install ultralytics
- 参考链接:https://github.com/ultralytics/ultralytics/issues/5793
- 这个只是一个可用性分析,无用
- 从网上分析来看,这个模型是有问题的,我是从https://www.modelscope.cn下载的可能是有问题的,查看了一下原始的模型有14G,而我本地的只有2G,不用git下载,用浏览器下载pytorch模型,然后用新的模型就是正常的了
- 方法一:
-
问题8:ValueError: Cannot find any model weight files. Please check your (cached) weight path: /model_repos/CustomLLM/Qwen-7B-QAnything
- https://github.com/lm-sys/FastChat/blob/main/fastchat/model/compression.py 这个代码里写着了,实际上是由的,重新拉去,直接用git clone拉取的模型是不对的,需要特别注意
-
问题9:import flash_attn rms_norm fail,import flash_attn rotary fail,import flash_attn fail,
- 参考如下的方式处理:
git clone https://github.com/Dao-AILab/flash-attention cd flash-attention && pip install . (安装太慢,使用pip install flash-attn --no-build-isolation 安装) # Below are optional. Installing them might be slow. pip install csrc/layer_norm (非常的占用cpu,需要等待) pip install csrc/rotary
- 安装报错: ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py”, line 2096, in _run_ninja_build
subprocess.run(
File “/usr/lib/python3.10/subprocess.py”, line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command ‘[‘ninja’, ‘-v’]’ returned non-zero exit status 1. - 在cpp_extension文件中将-v改为–version
- 安装报错: ninja: build stopped: subcommand failed.
- 参考如下的方式处理:
-
问题10:/usr/bin/ld: cannot find /mnt/c/QAnything/flash-attention/csrc/rotary/build/temp.linux-x86_64-cpython-310/rotary.o: No such file or directory
- 因为我装的是cuda12,版本不对,需要修改一下
- 尝试解决也是无效,参考链接:https://github.com/Dao-AILab/flash-attention/issues/484
-
问题11:run.sh 设置为-b hf 或者vllm是无法成功的
-
问题12: requests.exceptions.ReadTimeout: HTTPConnectionPool(host=‘0.0.0.0’, port=36001): Read timed out. (read timeout=60)
- 从实际的情况上来看,实际上答案已经给出了,只是这个36001服务没有启动成功导致的
- 36001是由 llm_server_run.sh启动的,但是实际上是没有启动成功的,需要排查一下为什么会没有启动成功
- 是通过Sanic第三方模块进行处理的,但是实际上没有能成功的启动36001这个端口,
- 重新启动了一下llm_server_run.sh后正常了
-
问题13:wsl安装ubuntu之后没有关在电脑硬盘
- 之前都是可以的,现在不行了,奇怪
- 经过重新连接wsl后正常,可能有一些奇怪的情况
-
问题14: Failed to install npm dependencies.
- 尝试安装apt install -y npm进行解决
-
问题15:Failed to build the front end.
- 之前以为是环境配置问题,实际发现重新安装了整个系统环境也是不行
- 直接在front_end执行npm run build 查看到有报错:
file:///mnt/c/QAnything/QAnything/front_end/node_modules/vite/bin/vite.js:7 await import('source-map-support').then((r) => r.default.install()) ^^^^^ SyntaxError: Unexpected reserved word
- 查看了npm的版本是8.5,可能是apt install npm的版本比较低导致的,尝试使用自动的npm进行安装尝试
- 调整npm的版本内容即可
-
问题16:peer closed connection without sending complete message body (incomplete chunked read)
- 具体的报错信息如下:
2024-02-26 12:06:58 | ERROR | stderr | ERROR: Exception in ASGI application 2024-02-26 12:06:58 | ERROR | stderr | Traceback (most recent call last): 2024-02-26 12:06:58 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 261, in __call__ 2024-02-26 12:06:58 | ERROR | stderr | await wrap(partial(self.listen_for_disconnect, receive)) 2024-02-26 12:06:58 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 257, in wrap 2024-02-26 12:06:58 | ERROR | stderr | await func() 2024-02-26 12:06:58 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/starlette/responses.py", line 234, in listen_for_disconnect 2024-02-26 12:06:58 | ERROR | stderr | message = await receive() 2024-02-26 12:06:58 | ERROR | stderr | File "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 587, in receive 2024-02-26 12:06:58 | ERROR | stderr | await self.message_event.wait() 2024-02-26 12:06:58 | ERROR | stderr | File "/usr/lib/python3.10/asyncio/locks.py", line 214, in wait 2024-02-26 12:06:58 | ERROR | stderr | await fut 2024-02-26 12:06:58 | ERROR | stderr | asyncio.exceptions.CancelledError: Cancelled by cancel scope 7fa254ef6140
- 在有结果出来的时候没有实际的相应,结果答案出不来
- 尝试方法:pip install gradio_client==0.2.7 参考:https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/11855
- 经过和大佬沟通,发现可能是模型的输出内容有问题导致的,需要换一个模型试试看,推荐说让用minichat-2B进行尝试一下
- 最终终于明白,原来是参数不对,正常的参数命令如下:
- bash ./run.sh -c local -i 0 -b hf -m chatglm3-6b -t chatglm3 即可正常了
参考链接
- https://blog.csdn.net/lianghao118/article/details/136087436
- https://github.com/Dao-AILab/flash-attention
- https://github.com/QwenLM/Qwen#quickstart